ADR-0233accepted

Consolidated Observability — Slog Provenance, Typesense Service, and ClickHouse

Context

Every slog entry and OTEL event is anonymous. There is no way to trace an entry back to the session that produced it or the machine it ran on. With multiple agents (pi sessions, codex workers, restate-worker pods, gateway daemon) writing concurrently, provenance is mandatory for meaningful observability.

Current state:

  • slog: JSONL file at ~/Vault/system/system-log.jsonl. Fields: timestamp, action, tool, detail, reason. No provenance. Typesense system_log collection exists (1929 docs) but is a stale sync target with no session/system attribution.
  • OTEL: Typesense otel_events collection (306K+ docs). Fields include source, component, action, but no sessionId or systemId. Can’t distinguish which pi session or which machine emitted an event.
  • No consolidated query surface: slog is file-only, OTEL is Typesense-only, Langfuse is separate. No single place to ask “what happened in session X on machine Y in the last hour?”
  • No time-series analytics: Typesense is a search engine, not an analytics database. Aggregations over 300K+ OTEL events are expensive and limited.

Decision

Phase 1: Slog Provenance (immediate)

Add sessionId and systemId as required fields on slog write.

Schema change (LogEntry, WriteInput):

sessionId: Schema.NonEmptyString  // e.g. "SleepyMagpie", "codex-task-42", "gateway"
systemId: Schema.NonEmptyString   // e.g. "panda", "restate-worker", "vercel"

CLI change (slog write):

  • Add --session and --system required flags
  • Env var fallback: SLOG_SESSION_ID, SLOG_SYSTEM_ID
  • Error if neither flag nor env var is present — no anonymous writes

SDK adapter change (@joelclaw/sdk slog-cli adapter):

  • Add sessionId and systemId to LogWriteArgsSchema
  • Pass through to slog write --session ... --system ...

Migration: Existing JSONL entries without provenance remain readable. normalizeRaw fills "unknown" for missing fields. Typesense system_log schema gets the new fields; old docs get backfilled with "unknown".

Phase 2: Slog → Typesense Dual-Write

Slog write dual-writes to both JSONL (audit trail, Vault sync) and Typesense system_log collection (search, filtering, network access).

  • Update system_log Typesense schema to include sessionId, systemId as filterable/searchable fields
  • Slog CLI writes to Typesense directly (or fires Inngest event for async indexing)
  • Backfill existing JSONL entries with provenance defaults

Phase 3: OTEL Provenance

Add sessionId and systemId to otel_events collection schema. Update all OTEL emission sites:

  • joelclaw otel emit CLI
  • @joelclaw/telemetry package
  • System-bus observe functions
  • Gateway middleware

Same env var convention: SLOG_SESSION_ID / SLOG_SYSTEM_ID (shared across slog and OTEL — one identity per process).

Phase 4: Network Query Surface

Typesense is already accessible over Tailscale (https://panda.tail7af24.ts.net/typesense). Formalize:

  • joelclaw o11y query — unified CLI command that searches across otel_events and system_log with session/system filters
  • joelclaw o11y session <sessionId> — all events from a session, chronological
  • joelclaw o11y system <systemId> — all events from a machine
  • Network agents (restate-worker, future remote agents) can query Typesense directly

Phase 5: ClickHouse Evaluation

Typesense handles search well but is not designed for:

  • Time-series aggregations (events per hour, error rate trends)
  • Long-term retention with efficient compression
  • OLAP-style queries across millions of events
  • Native OTEL Collector protocol support

Evaluate ClickHouse as the consolidated analytics backend:

  • Run as k8s StatefulSet alongside Typesense
  • Native OTEL Collector support (receive spans/logs/metrics directly)
  • Slog entries written to ClickHouse for time-series analytics
  • Langfuse traces optionally forwarded
  • Typesense remains the search layer; ClickHouse is the analytics layer
  • joelclaw o11y routes to appropriate backend based on query type (search → Typesense, aggregation → ClickHouse)

Decision criteria for proceeding: If Phase 4 queries reveal that Typesense aggregation performance degrades past 500K events, or if cross-source correlation (slog + OTEL + Langfuse) becomes a frequent need, proceed with ClickHouse. Otherwise, Typesense alone is sufficient.

Consequences

  • Every slog and OTEL entry traceable to session + machine
  • Agents that shell to slog write without provenance will error — forces discipline
  • Typesense becomes the networked observability query surface
  • ClickHouse (if adopted) adds a k8s pod but consolidates three separate data stores
  • Migration: existing entries get "unknown" provenance — lossy but honest
  • All callers of joelclaw otel emit and slog write must be updated (system-bus functions, CLI, extensions, gateway)

Implementation Order

  1. ✅ ADR written
  2. → Phase 1: slog schema + CLI + SDK (this session)
  3. Phase 2: Typesense dual-write
  4. Phase 3: OTEL provenance
  5. Phase 4: Unified query CLI
  6. Phase 5: ClickHouse evaluation (gated on data volume / query needs)