ADR-0006superseded

Adopt Prometheus + Grafana for system observability

Superseded by [ADR-0033 — VictoriaMetrics + Grafana monitoring stack](0033-victoriametrics-grafana-monitoring-stack.md)

Context and Problem Statement

The Mac Mini runs 3 Docker containers (Inngest, Qdrant, Redis), a Bun worker (system-bus), and Caddy — all managed via Docker Compose and launchd. AGENTS.md principle #2 states “Observability over opacity — if you can’t see it, you can’t fix it.”

Today, observability is ad-hoc: the Inngest dashboard shows workflow traces, Docker healthchecks probe containers, slog logs config changes, and Qdrant exposes Prometheus metrics that nobody scrapes. There is no metrics history, no unified dashboard, no alerting, and no system-level visibility (CPU, memory, disk). Logs are scattered across Docker stdout, launchd stderr, and system-log.jsonl.

The question: How should this single-machine personal system collect, store, visualize, and alert on operational metrics — given an 8GB RAM budget and one operator?

Decision Drivers

  • 8GB RAM constraint — current Docker containers use ~411MB; every new container costs real memory
  • Single operator — Joel only; no on-call rotation, no team dashboards, no multi-tenant concerns
  • Already running Docker Compose — adding containers is operationally cheap (one docker compose up)
  • AGENTS.md principle #2 — “Observability over opacity” is a stated system value
  • Tailscale access — dashboards should be reachable from any device on the tailnet, not just SSH
  • Prototype stage — avoid over-engineering; favor reversible choices
  • Existing Prometheus surface — Qdrant already serves /metrics in Prometheus format; Redis and the host can be instrumented cheaply

Considered Options

  • Option 1: Prometheus + Grafana stack — industry-standard metrics pipeline with optional Loki for logs
  • Option 2: Lightweight custom approach — extend slog or build a Bun/TypeScript metrics collector writing to SQLite
  • Option 3: Do nothing yet — continue ad-hoc checks; revisit when complexity grows

Decision Outcome

Chosen option: “Option 1: Prometheus + Grafana stack”, because the infrastructure already exists to support it (Docker Compose, Qdrant metrics endpoint, Caddy for HTTPS), the memory cost is manageable (~150–200MB total for Prometheus + Grafana), and it provides a battle-tested foundation that grows with the system. A custom approach (Option 2) would require building and maintaining bespoke tooling for a problem that’s already well-solved. Doing nothing (Option 3) violates the stated observability principle and leaves pipeline failures silent.

However — implement in phases. Start with Prometheus + node_exporter only (Phase 1), add Grafana when dashboards are needed (Phase 2), and defer Loki/log aggregation until log volume or scatter actually causes pain (Phase 3).

Consequences

  • Good, because Qdrant metrics are immediately captured with zero code changes
  • Good, because node_exporter provides Mac Mini CPU/memory/disk/temp visibility — the biggest current blind spot
  • Good, because Grafana alerting can notify on pipeline failures, disk pressure, or container restarts
  • Good, because Caddy + Tailscale makes dashboards accessible from iPhone/iPad/laptop
  • Good, because Prometheus + Grafana are universally understood — vast ecosystem of dashboards, exporters, and docs
  • Bad, because it adds ~150–200MB RAM to an 8GB machine (bringing Docker total to ~600MB)
  • Bad, because Prometheus retention on a Mac Mini should be short (15 days) to avoid disk bloat
  • Bad, because Grafana adds operational surface area (auth, provisioning, upgrades)
  • Neutral, because the Inngest dashboard remains the primary tool for workflow debugging — Grafana supplements but doesn’t replace it

Implementation Plan

Phase 1: Metrics Collection (Prometheus + exporters)

Add to ~/Code/system-bus/docker-compose.yml:

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
      - '--storage.tsdb.retention.size=1GB'
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 5s
      retries: 3
 
  redis-exporter:
    image: oliver006/redis_exporter:latest
    ports:
      - "9121:9121"
    environment:
      - REDIS_ADDR=redis://redis:6379
    depends_on:
      - redis
    restart: unless-stopped
 
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
    restart: unless-stopped

Note on node_exporter: The Linux container image won’t expose macOS-specific metrics. For full Mac Mini visibility, consider running node_exporter natively via Homebrew (brew install node_exporter) and scraping host.docker.internal:9100 instead. Alternatively, use a macOS-compatible exporter like macos_exporter. Evaluate during implementation.

Create ~/Code/system-bus/prometheus/prometheus.yml:

global:
  scrape_interval: 30s
  evaluation_interval: 30s
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: 'qdrant'
    static_configs:
      - targets: ['qdrant:6333']
 
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
 
  - job_name: 'node'
    static_configs:
      - targets: ['host.docker.internal:9100']  # if running natively
      # - targets: ['node-exporter:9100']        # if running in Docker

Estimated memory: Prometheus ~80–100MB, redis_exporter ~10MB, node_exporter ~15MB.

Phase 2: Dashboards (Grafana)

Add to docker-compose.yml:

  grafana:
    image: grafana/grafana-oss:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_AUTH_ANONYMOUS_ENABLED=false
    depends_on:
      - prometheus
    restart: unless-stopped

Create ~/Code/system-bus/grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

Add Caddy route for Tailscale HTTPS access:

grafana.three-body.tail[...].ts.net {
    reverse_proxy localhost:3000
    tls /path/to/tailscale/certs/cert.pem /path/to/tailscale/certs/key.pem
}

Estimated memory: Grafana ~50–80MB.

Initial dashboards (import from Grafana marketplace):

  • Node Exporter Full (ID: 1860) — CPU, memory, disk, network
  • Redis Dashboard (ID: 763) — connections, memory, commands/sec
  • Qdrant — custom or community dashboard using the exposed metrics

Phase 3: Log Aggregation (Loki) — Deferred

Only add Loki when scattered logs actually cause debugging pain. When ready:

  • Loki container (~40MB) + Promtail sidecar
  • Scrape Docker container logs + launchd stderr + system-log.jsonl
  • Grafana Explore for unified log search

Alerting Rules (Phase 2)

Start with a small set of high-signal alerts. Add to Prometheus rules or Grafana alerts:

AlertConditionWhy it matters
Disk > 80%node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2Mac Mini has limited SSD; video ingest fills disk
Container restartincrease(container_restart_count[5m]) > 0Inngest/Qdrant/Redis crashed and restarted
Qdrant memory > 1GBqdrant_memory_allocated_bytes > 1e9Vector store growing beyond budget on 8GB machine
Redis memory > 200MBredis_memory_used_bytes > 2e8Approaching the 256MB maxmemory limit

Alert delivery: start with Grafana → webhook → Inngest event (system/alert), which the system-logger can write to system-log.jsonl. Add Telegram/Slack/email later.

  • Affected paths: ~/Code/system-bus/docker-compose.yml, new ~/Code/system-bus/prometheus/ dir, new ~/Code/system-bus/grafana/ dir, ~/.local/caddy/Caddyfile
  • Dependencies: Docker images prom/prometheus, oliver006/redis_exporter, prom/node-exporter, grafana/grafana-oss. Optionally brew install node_exporter for native macOS metrics.
  • Patterns to follow: All new services go in the existing docker-compose.yml. Caddy routes follow existing pattern in Caddyfile. Credentials via agent-secrets.
  • Patterns to avoid: Do NOT expose Grafana or Prometheus to the public internet — Tailscale only. Do NOT set Prometheus retention beyond 15 days on this machine. Do NOT add Loki in Phase 1 or 2.
  • Configuration: Store GRAFANA_ADMIN_PASSWORD in agent-secrets. Prometheus config is file-based (no env vars needed).
  • Migration steps: None — this is additive. Existing services are unchanged.

Verification

Phase 1

  • docker compose up starts Prometheus, redis-exporter, and node-exporter without errors
  • curl -s http://localhost:9090/api/v1/targets shows Qdrant, Redis, and node targets as "health": "up"
  • curl -s http://localhost:9090/api/v1/query?query=up returns results for all scrape targets
  • docker stats --no-stream shows Prometheus using < 120MB
  • Total Docker memory (all containers) stays under 700MB

Phase 2

  • Grafana accessible at https://grafana.<tailscale-hostname>/ from another tailnet device
  • Prometheus datasource shows “Data source is working” in Grafana → Settings → Data Sources
  • At least one dashboard shows live Qdrant and Redis metrics
  • Disk usage alert fires when manually tested (e.g., lower threshold temporarily)

Pros and Cons of the Options

Option 1: Prometheus + Grafana stack

Industry-standard observability stack. Prometheus scrapes metrics endpoints on a schedule and stores time-series data. Grafana visualizes and alerts.

  • Good, because Qdrant /metrics works out of the box — zero code changes
  • Good, because redis_exporter and node_exporter are mature, single-binary, low-overhead
  • Good, because Grafana has thousands of pre-built dashboards — no custom UI work
  • Good, because alerting is built into Grafana — webhook, email, Slack, Telegram
  • Good, because universal knowledge — any future collaborator (human or agent) knows this stack
  • Neutral, because 30s scrape interval is plenty for a personal system (not real-time)
  • Bad, because adds ~150–200MB RAM across 3–4 new containers
  • Bad, because node_exporter in Docker on macOS has limited host visibility — may need native install
  • Bad, because Grafana provisioning (dashboards-as-code) has a learning curve
  • Bad, because it’s arguably over-built for 3 containers on one machine

Option 2: Lightweight custom approach

Build a Bun/TypeScript service that periodically polls docker stats, Qdrant /metrics, Redis INFO, and writes to SQLite. Query via CLI (slog-style) or a minimal web UI.

  • Good, because zero new containers — runs as another Inngest function or launchd service
  • Good, because full control over what’s collected and how it’s stored
  • Good, because SQLite + Bun is extremely lightweight (~10MB)
  • Good, because fits the “build tools you understand” philosophy
  • Bad, because building a metrics collector, storage layer, query API, and alerting from scratch is significant work
  • Bad, because no pre-built dashboards — every visualization is custom
  • Bad, because alerting would need to be built from scratch
  • Bad, because not portable — future collaborators can’t leverage existing knowledge
  • Bad, because it reinvents what Prometheus already does well

Option 3: Do nothing yet

Continue using docker stats, curl, Inngest dashboard, and slog tail for ad-hoc checks. Revisit when the system grows.

  • Good, because zero additional resource usage
  • Good, because no new operational surface area to maintain
  • Good, because the system is young — unclear what metrics actually matter yet
  • Bad, because Qdrant metrics are generated and immediately discarded — wasted signal
  • Bad, because pipeline failures are silent — you discover them hours or days later
  • Bad, because “do nothing” contradicts AGENTS.md principle #2 (“Observability over opacity”)
  • Bad, because as more Inngest functions are added, the blind spots compound

More Information

  • ADR-0002: Established vault as single source of truth and “observability over opacity” as a core principle
  • ADR-0005: Multi-agent Inngest pipelines that produce the workflow traces needing observability

Memory Budget

ServiceCurrentAfter Phase 1After Phase 2
Inngest78 MB78 MB78 MB
Qdrant312 MB312 MB312 MB
Redis21 MB21 MB21 MB
Prometheus~100 MB~100 MB
redis_exporter~10 MB~10 MB
node_exporter~15 MB~15 MB
Grafana~70 MB
Total411 MB~536 MB~606 MB

Leaves ~7GB for macOS, the Bun worker, pi, Claude Code, and other processes. Acceptable.

Revisit Triggers

  • If total Docker memory exceeds 1GB → re-evaluate what’s running or increase to 16GB RAM
  • If Prometheus disk exceeds 1GB → reduce retention or add downsampling
  • If Grafana goes unused for 30 days → remove it and keep Prometheus + CLI queries
  • If a better macOS-native monitoring solution emerges → evaluate replacing node_exporter
  • If Inngest adds native Prometheus metrics → add as scrape target

Open Questions

  • Inngest /metrics auth: The endpoint returned “Authentication failed.” Need to investigate whether the signing key or event key unlocks it. If so, add as a Prometheus scrape target.
  • macOS node_exporter: The Docker image mounts /proc and /sys which don’t exist on macOS. Need to test whether Homebrew node_exporter or a macOS-specific exporter is needed.
  • Alert delivery: Webhook to Inngest is elegant but circular (monitoring depends on the thing being monitored). May need an independent notification path (e.g., direct Telegram bot).