Adopt Prometheus + Grafana for system observability
Superseded by [ADR-0033 — VictoriaMetrics + Grafana monitoring stack](0033-victoriametrics-grafana-monitoring-stack.md)
Context and Problem Statement
The Mac Mini runs 3 Docker containers (Inngest, Qdrant, Redis), a Bun worker (system-bus), and Caddy — all managed via Docker Compose and launchd. AGENTS.md principle #2 states “Observability over opacity — if you can’t see it, you can’t fix it.”
Today, observability is ad-hoc: the Inngest dashboard shows workflow traces, Docker healthchecks probe containers, slog logs config changes, and Qdrant exposes Prometheus metrics that nobody scrapes. There is no metrics history, no unified dashboard, no alerting, and no system-level visibility (CPU, memory, disk). Logs are scattered across Docker stdout, launchd stderr, and system-log.jsonl.
The question: How should this single-machine personal system collect, store, visualize, and alert on operational metrics — given an 8GB RAM budget and one operator?
Decision Drivers
- 8GB RAM constraint — current Docker containers use ~411MB; every new container costs real memory
- Single operator — Joel only; no on-call rotation, no team dashboards, no multi-tenant concerns
- Already running Docker Compose — adding containers is operationally cheap (one
docker compose up) - AGENTS.md principle #2 — “Observability over opacity” is a stated system value
- Tailscale access — dashboards should be reachable from any device on the tailnet, not just SSH
- Prototype stage — avoid over-engineering; favor reversible choices
- Existing Prometheus surface — Qdrant already serves
/metricsin Prometheus format; Redis and the host can be instrumented cheaply
Considered Options
- Option 1: Prometheus + Grafana stack — industry-standard metrics pipeline with optional Loki for logs
- Option 2: Lightweight custom approach — extend
slogor build a Bun/TypeScript metrics collector writing to SQLite - Option 3: Do nothing yet — continue ad-hoc checks; revisit when complexity grows
Decision Outcome
Chosen option: “Option 1: Prometheus + Grafana stack”, because the infrastructure already exists to support it (Docker Compose, Qdrant metrics endpoint, Caddy for HTTPS), the memory cost is manageable (~150–200MB total for Prometheus + Grafana), and it provides a battle-tested foundation that grows with the system. A custom approach (Option 2) would require building and maintaining bespoke tooling for a problem that’s already well-solved. Doing nothing (Option 3) violates the stated observability principle and leaves pipeline failures silent.
However — implement in phases. Start with Prometheus + node_exporter only (Phase 1), add Grafana when dashboards are needed (Phase 2), and defer Loki/log aggregation until log volume or scatter actually causes pain (Phase 3).
Consequences
- Good, because Qdrant metrics are immediately captured with zero code changes
- Good, because
node_exporterprovides Mac Mini CPU/memory/disk/temp visibility — the biggest current blind spot - Good, because Grafana alerting can notify on pipeline failures, disk pressure, or container restarts
- Good, because Caddy + Tailscale makes dashboards accessible from iPhone/iPad/laptop
- Good, because Prometheus + Grafana are universally understood — vast ecosystem of dashboards, exporters, and docs
- Bad, because it adds ~150–200MB RAM to an 8GB machine (bringing Docker total to ~600MB)
- Bad, because Prometheus retention on a Mac Mini should be short (15 days) to avoid disk bloat
- Bad, because Grafana adds operational surface area (auth, provisioning, upgrades)
- Neutral, because the Inngest dashboard remains the primary tool for workflow debugging — Grafana supplements but doesn’t replace it
Implementation Plan
Phase 1: Metrics Collection (Prometheus + exporters)
Add to ~/Code/system-bus/docker-compose.yml:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
- '--storage.tsdb.retention.size=1GB'
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 5s
retries: 3
redis-exporter:
image: oliver006/redis_exporter:latest
ports:
- "9121:9121"
environment:
- REDIS_ADDR=redis://redis:6379
depends_on:
- redis
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
restart: unless-stoppedNote on node_exporter: The Linux container image won’t expose macOS-specific metrics. For full Mac Mini visibility, consider running
node_exporternatively via Homebrew (brew install node_exporter) and scrapinghost.docker.internal:9100instead. Alternatively, use a macOS-compatible exporter likemacos_exporter. Evaluate during implementation.
Create ~/Code/system-bus/prometheus/prometheus.yml:
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'qdrant'
static_configs:
- targets: ['qdrant:6333']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'node'
static_configs:
- targets: ['host.docker.internal:9100'] # if running natively
# - targets: ['node-exporter:9100'] # if running in DockerEstimated memory: Prometheus ~80–100MB, redis_exporter ~10MB, node_exporter ~15MB.
Phase 2: Dashboards (Grafana)
Add to docker-compose.yml:
grafana:
image: grafana/grafana-oss:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_AUTH_ANONYMOUS_ENABLED=false
depends_on:
- prometheus
restart: unless-stoppedCreate ~/Code/system-bus/grafana/provisioning/datasources/prometheus.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: trueAdd Caddy route for Tailscale HTTPS access:
grafana.three-body.tail[...].ts.net {
reverse_proxy localhost:3000
tls /path/to/tailscale/certs/cert.pem /path/to/tailscale/certs/key.pem
}Estimated memory: Grafana ~50–80MB.
Initial dashboards (import from Grafana marketplace):
- Node Exporter Full (ID: 1860) — CPU, memory, disk, network
- Redis Dashboard (ID: 763) — connections, memory, commands/sec
- Qdrant — custom or community dashboard using the exposed metrics
Phase 3: Log Aggregation (Loki) — Deferred
Only add Loki when scattered logs actually cause debugging pain. When ready:
- Loki container (~40MB) + Promtail sidecar
- Scrape Docker container logs + launchd stderr + system-log.jsonl
- Grafana Explore for unified log search
Alerting Rules (Phase 2)
Start with a small set of high-signal alerts. Add to Prometheus rules or Grafana alerts:
| Alert | Condition | Why it matters |
|---|---|---|
| Disk > 80% | node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2 | Mac Mini has limited SSD; video ingest fills disk |
| Container restart | increase(container_restart_count[5m]) > 0 | Inngest/Qdrant/Redis crashed and restarted |
| Qdrant memory > 1GB | qdrant_memory_allocated_bytes > 1e9 | Vector store growing beyond budget on 8GB machine |
| Redis memory > 200MB | redis_memory_used_bytes > 2e8 | Approaching the 256MB maxmemory limit |
Alert delivery: start with Grafana → webhook → Inngest event (system/alert), which the system-logger can write to system-log.jsonl. Add Telegram/Slack/email later.
- Affected paths:
~/Code/system-bus/docker-compose.yml, new~/Code/system-bus/prometheus/dir, new~/Code/system-bus/grafana/dir,~/.local/caddy/Caddyfile - Dependencies: Docker images
prom/prometheus,oliver006/redis_exporter,prom/node-exporter,grafana/grafana-oss. Optionallybrew install node_exporterfor native macOS metrics. - Patterns to follow: All new services go in the existing
docker-compose.yml. Caddy routes follow existing pattern in Caddyfile. Credentials viaagent-secrets. - Patterns to avoid: Do NOT expose Grafana or Prometheus to the public internet — Tailscale only. Do NOT set Prometheus retention beyond 15 days on this machine. Do NOT add Loki in Phase 1 or 2.
- Configuration: Store
GRAFANA_ADMIN_PASSWORDin agent-secrets. Prometheus config is file-based (no env vars needed). - Migration steps: None — this is additive. Existing services are unchanged.
Verification
Phase 1
-
docker compose upstarts Prometheus, redis-exporter, and node-exporter without errors -
curl -s http://localhost:9090/api/v1/targetsshows Qdrant, Redis, and node targets as"health": "up" -
curl -s http://localhost:9090/api/v1/query?query=upreturns results for all scrape targets -
docker stats --no-streamshows Prometheus using < 120MB - Total Docker memory (all containers) stays under 700MB
Phase 2
- Grafana accessible at
https://grafana.<tailscale-hostname>/from another tailnet device - Prometheus datasource shows “Data source is working” in Grafana → Settings → Data Sources
- At least one dashboard shows live Qdrant and Redis metrics
- Disk usage alert fires when manually tested (e.g., lower threshold temporarily)
Pros and Cons of the Options
Option 1: Prometheus + Grafana stack
Industry-standard observability stack. Prometheus scrapes metrics endpoints on a schedule and stores time-series data. Grafana visualizes and alerts.
- Good, because Qdrant
/metricsworks out of the box — zero code changes - Good, because
redis_exporterandnode_exporterare mature, single-binary, low-overhead - Good, because Grafana has thousands of pre-built dashboards — no custom UI work
- Good, because alerting is built into Grafana — webhook, email, Slack, Telegram
- Good, because universal knowledge — any future collaborator (human or agent) knows this stack
- Neutral, because 30s scrape interval is plenty for a personal system (not real-time)
- Bad, because adds ~150–200MB RAM across 3–4 new containers
- Bad, because
node_exporterin Docker on macOS has limited host visibility — may need native install - Bad, because Grafana provisioning (dashboards-as-code) has a learning curve
- Bad, because it’s arguably over-built for 3 containers on one machine
Option 2: Lightweight custom approach
Build a Bun/TypeScript service that periodically polls docker stats, Qdrant /metrics, Redis INFO, and writes to SQLite. Query via CLI (slog-style) or a minimal web UI.
- Good, because zero new containers — runs as another Inngest function or launchd service
- Good, because full control over what’s collected and how it’s stored
- Good, because SQLite + Bun is extremely lightweight (~10MB)
- Good, because fits the “build tools you understand” philosophy
- Bad, because building a metrics collector, storage layer, query API, and alerting from scratch is significant work
- Bad, because no pre-built dashboards — every visualization is custom
- Bad, because alerting would need to be built from scratch
- Bad, because not portable — future collaborators can’t leverage existing knowledge
- Bad, because it reinvents what Prometheus already does well
Option 3: Do nothing yet
Continue using docker stats, curl, Inngest dashboard, and slog tail for ad-hoc checks. Revisit when the system grows.
- Good, because zero additional resource usage
- Good, because no new operational surface area to maintain
- Good, because the system is young — unclear what metrics actually matter yet
- Bad, because Qdrant metrics are generated and immediately discarded — wasted signal
- Bad, because pipeline failures are silent — you discover them hours or days later
- Bad, because “do nothing” contradicts AGENTS.md principle #2 (“Observability over opacity”)
- Bad, because as more Inngest functions are added, the blind spots compound
More Information
Related ADRs
- ADR-0002: Established vault as single source of truth and “observability over opacity” as a core principle
- ADR-0005: Multi-agent Inngest pipelines that produce the workflow traces needing observability
Memory Budget
| Service | Current | After Phase 1 | After Phase 2 |
|---|---|---|---|
| Inngest | 78 MB | 78 MB | 78 MB |
| Qdrant | 312 MB | 312 MB | 312 MB |
| Redis | 21 MB | 21 MB | 21 MB |
| Prometheus | — | ~100 MB | ~100 MB |
| redis_exporter | — | ~10 MB | ~10 MB |
| node_exporter | — | ~15 MB | ~15 MB |
| Grafana | — | — | ~70 MB |
| Total | 411 MB | ~536 MB | ~606 MB |
Leaves ~7GB for macOS, the Bun worker, pi, Claude Code, and other processes. Acceptable.
Revisit Triggers
- If total Docker memory exceeds 1GB → re-evaluate what’s running or increase to 16GB RAM
- If Prometheus disk exceeds 1GB → reduce retention or add downsampling
- If Grafana goes unused for 30 days → remove it and keep Prometheus + CLI queries
- If a better macOS-native monitoring solution emerges → evaluate replacing node_exporter
- If Inngest adds native Prometheus metrics → add as scrape target
Open Questions
- Inngest
/metricsauth: The endpoint returned “Authentication failed.” Need to investigate whether the signing key or event key unlocks it. If so, add as a Prometheus scrape target. - macOS node_exporter: The Docker image mounts
/procand/syswhich don’t exist on macOS. Need to test whether Homebrewnode_exporteror a macOS-specific exporter is needed. - Alert delivery: Webhook to Inngest is elegant but circular (monitoring depends on the thing being monitored). May need an independent notification path (e.g., direct Telegram bot).