ADR-0233accepted

Consolidated Observability — Slog Provenance, Typesense Service, and ClickHouse

2026-03-17T00:00:00.000Z

Context

Every slog entry and OTEL event is anonymous. There is no way to trace an entry back to the session that produced it or the machine it ran on. With multiple agents (pi sessions, codex workers, restate-worker pods, gateway daemon) writing concurrently, provenance is mandatory for meaningful observability.

Current state:

slog: JSONL file at ~/Vault/system/system-log.jsonl. Fields: timestamp, action, tool, detail, reason. No provenance. Typesense system_log collection exists (1929 docs) but is a stale sync target with no session/system attribution.
OTEL: Typesense otel_events collection (306K+ docs). Fields include source, component, action, but no sessionId or systemId. Can’t distinguish which pi session or which machine emitted an event.
No consolidated query surface: slog is file-only, OTEL is Typesense-only, Langfuse is separate. No single place to ask “what happened in session X on machine Y in the last hour?”
No time-series analytics: Typesense is a search engine, not an analytics database. Aggregations over 300K+ OTEL events are expensive and limited.

Decision

Phase 1: Slog Provenance (immediate)

Add sessionId and systemId as required fields on slog write.

Schema change (LogEntry, WriteInput):

sessionId: Schema.NonEmptyString  // e.g. "SleepyMagpie", "codex-task-42", "gateway"
systemId: Schema.NonEmptyString   // e.g. "panda", "restate-worker", "vercel"

CLI change (slog write):

Add --session and --system required flags
Env var fallback: SLOG_SESSION_ID, SLOG_SYSTEM_ID
Error if neither flag nor env var is present — no anonymous writes

SDK adapter change (@joelclaw/sdk slog-cli adapter):

Add sessionId and systemId to LogWriteArgsSchema
Pass through to slog write --session ... --system ...

Migration: Existing JSONL entries without provenance remain readable. normalizeRaw fills "unknown" for missing fields. Typesense system_log schema gets the new fields; old docs get backfilled with "unknown".

Phase 2: Slog → Typesense Dual-Write

Slog write dual-writes to both JSONL (audit trail, Vault sync) and Typesense system_log collection (search, filtering, network access).

Update system_log Typesense schema to include sessionId, systemId as filterable/searchable fields
Slog CLI writes to Typesense directly (or fires Inngest event for async indexing)
Backfill existing JSONL entries with provenance defaults

Phase 3: OTEL Provenance

Add sessionId and systemId to otel_events collection schema. Update all OTEL emission sites:

joelclaw otel emit CLI
@joelclaw/telemetry package
System-bus observe functions
Gateway middleware

Same env var convention: SLOG_SESSION_ID / SLOG_SYSTEM_ID (shared across slog and OTEL — one identity per process).

Phase 4: Network Query Surface

Typesense is already accessible over Tailscale (https://panda.tail7af24.ts.net/typesense). Formalize:

joelclaw o11y query — unified CLI command that searches across otel_events and system_log with session/system filters
joelclaw o11y session <sessionId> — all events from a session, chronological
joelclaw o11y system <systemId> — all events from a machine
Network agents (restate-worker, future remote agents) can query Typesense directly

Phase 5: ClickHouse Evaluation

Typesense handles search well but is not designed for:

Time-series aggregations (events per hour, error rate trends)
Long-term retention with efficient compression
OLAP-style queries across millions of events
Native OTEL Collector protocol support

Evaluate ClickHouse as the consolidated analytics backend:

Run as k8s StatefulSet alongside Typesense
Native OTEL Collector support (receive spans/logs/metrics directly)
Slog entries written to ClickHouse for time-series analytics
Langfuse traces optionally forwarded
Typesense remains the search layer; ClickHouse is the analytics layer
joelclaw o11y routes to appropriate backend based on query type (search → Typesense, aggregation → ClickHouse)

Decision criteria for proceeding: If Phase 4 queries reveal that Typesense aggregation performance degrades past 500K events, or if cross-source correlation (slog + OTEL + Langfuse) becomes a frequent need, proceed with ClickHouse. Otherwise, Typesense alone is sufficient.

Consequences

Every slog and OTEL entry traceable to session + machine
Agents that shell to slog write without provenance will error — forces discipline
Typesense becomes the networked observability query surface
ClickHouse (if adopted) adds a k8s pod but consolidates three separate data stores
Migration: existing entries get "unknown" provenance — lossy but honest
All callers of joelclaw otel emit and slog write must be updated (system-bus functions, CLI, extensions, gateway)

Implementation Order

✅ ADR written
✅ Phase 1: slog schema + CLI + SDK — sessionId/systemId required, backfill 1902 entries
✅ Phase 2: Typesense sync — system_log collection has sessionId/systemId (filterable, facetable), system-logger writes provenance from events, indexSystemLog() includes provenance, 2487 docs re-indexed
✅ Phase 3: OTEL provenance — OtelEvent type has sessionId/systemId, resolves from input → SLOG_SESSION_ID/SLOG_SYSTEM_ID env → “unknown”. otel_events collection patched with fields on first emit. Worker env set (system-bus/panda). CLI emit passes provenance through SDK adapter.
✅ Phase 4: Unified query CLI — joelclaw o11y session <id> and joelclaw o11y system <id> multi_search across otel_events + system_log. joelclaw otel list/search accept --session and --system filters. Gateway plist updated with provenance env vars.
Phase 5: ClickHouse evaluation (gated on data volume / query needs)