Consolidated Observability — Slog Provenance, Typesense Service, and ClickHouse
Context
Every slog entry and OTEL event is anonymous. There is no way to trace an entry back to the session that produced it or the machine it ran on. With multiple agents (pi sessions, codex workers, restate-worker pods, gateway daemon) writing concurrently, provenance is mandatory for meaningful observability.
Current state:
- slog: JSONL file at
~/Vault/system/system-log.jsonl. Fields:timestamp,action,tool,detail,reason. No provenance. Typesensesystem_logcollection exists (1929 docs) but is a stale sync target with no session/system attribution. - OTEL: Typesense
otel_eventscollection (306K+ docs). Fields includesource,component,action, but nosessionIdorsystemId. Can’t distinguish which pi session or which machine emitted an event. - No consolidated query surface: slog is file-only, OTEL is Typesense-only, Langfuse is separate. No single place to ask “what happened in session X on machine Y in the last hour?”
- No time-series analytics: Typesense is a search engine, not an analytics database. Aggregations over 300K+ OTEL events are expensive and limited.
Decision
Phase 1: Slog Provenance (immediate)
Add sessionId and systemId as required fields on slog write.
Schema change (LogEntry, WriteInput):
sessionId: Schema.NonEmptyString // e.g. "SleepyMagpie", "codex-task-42", "gateway"
systemId: Schema.NonEmptyString // e.g. "panda", "restate-worker", "vercel"CLI change (slog write):
- Add
--sessionand--systemrequired flags - Env var fallback:
SLOG_SESSION_ID,SLOG_SYSTEM_ID - Error if neither flag nor env var is present — no anonymous writes
SDK adapter change (@joelclaw/sdk slog-cli adapter):
- Add
sessionIdandsystemIdtoLogWriteArgsSchema - Pass through to
slog write --session ... --system ...
Migration: Existing JSONL entries without provenance remain readable. normalizeRaw fills "unknown" for missing fields. Typesense system_log schema gets the new fields; old docs get backfilled with "unknown".
Phase 2: Slog → Typesense Dual-Write
Slog write dual-writes to both JSONL (audit trail, Vault sync) and Typesense system_log collection (search, filtering, network access).
- Update
system_logTypesense schema to includesessionId,systemIdas filterable/searchable fields - Slog CLI writes to Typesense directly (or fires Inngest event for async indexing)
- Backfill existing JSONL entries with provenance defaults
Phase 3: OTEL Provenance
Add sessionId and systemId to otel_events collection schema. Update all OTEL emission sites:
joelclaw otel emitCLI@joelclaw/telemetrypackage- System-bus observe functions
- Gateway middleware
Same env var convention: SLOG_SESSION_ID / SLOG_SYSTEM_ID (shared across slog and OTEL — one identity per process).
Phase 4: Network Query Surface
Typesense is already accessible over Tailscale (https://panda.tail7af24.ts.net/typesense). Formalize:
joelclaw o11y query— unified CLI command that searches acrossotel_eventsandsystem_logwith session/system filtersjoelclaw o11y session <sessionId>— all events from a session, chronologicaljoelclaw o11y system <systemId>— all events from a machine- Network agents (restate-worker, future remote agents) can query Typesense directly
Phase 5: ClickHouse Evaluation
Typesense handles search well but is not designed for:
- Time-series aggregations (events per hour, error rate trends)
- Long-term retention with efficient compression
- OLAP-style queries across millions of events
- Native OTEL Collector protocol support
Evaluate ClickHouse as the consolidated analytics backend:
- Run as k8s StatefulSet alongside Typesense
- Native OTEL Collector support (receive spans/logs/metrics directly)
- Slog entries written to ClickHouse for time-series analytics
- Langfuse traces optionally forwarded
- Typesense remains the search layer; ClickHouse is the analytics layer
joelclaw o11yroutes to appropriate backend based on query type (search → Typesense, aggregation → ClickHouse)
Decision criteria for proceeding: If Phase 4 queries reveal that Typesense aggregation performance degrades past 500K events, or if cross-source correlation (slog + OTEL + Langfuse) becomes a frequent need, proceed with ClickHouse. Otherwise, Typesense alone is sufficient.
Consequences
- Every slog and OTEL entry traceable to session + machine
- Agents that shell to
slog writewithout provenance will error — forces discipline - Typesense becomes the networked observability query surface
- ClickHouse (if adopted) adds a k8s pod but consolidates three separate data stores
- Migration: existing entries get
"unknown"provenance — lossy but honest - All callers of
joelclaw otel emitandslog writemust be updated (system-bus functions, CLI, extensions, gateway)
Implementation Order
- ✅ ADR written
- → Phase 1: slog schema + CLI + SDK (this session)
- Phase 2: Typesense dual-write
- Phase 3: OTEL provenance
- Phase 4: Unified query CLI
- Phase 5: ClickHouse evaluation (gated on data volume / query needs)