ADR-0194shipped

Inngest Runtime SQLite Forensics and Stale-Run Sweep Contract

  • Status: shipped
  • Date: 2026-03-02
  • Deciders: Joel, Panda
  • Relates to: ADR-0089, ADR-0139, ADR-0182, ADR-0191

Context

During a runtime restart window, repeated SDK dispatch failures (Unable to reach SDK URL, EOF writing request to SDK) left historical Inngest runs in an inconsistent state:

  • joelclaw runs --status RUNNING surfaced old health jobs,
  • cancelRun returned not found for those IDs,
  • some run details resolved as terminal while list metadata stayed stale.

This is a mask-layer mismatch between list metadata and terminal execution records.

Decision

1) Establish runtime truth hierarchy

For stale-run incidents, truth order is:

  1. joelclaw run <run-id> trace + errors
  2. Inngest terminal history rows (FunctionCompleted|FunctionFailed|FunctionCancelled)
  3. function_finishes
  4. runs list metadata

List output is not authoritative when these disagree.

2) Codify raw DB access path

Inngest runtime state is SQLite in the k8s StatefulSet pod/PVC:

  • pod: inngest-0
  • db: /data/main.db

This DB is the canonical recovery surface when GraphQL cancel endpoints cannot locate stale executions.

3) Require backup-first mutation discipline

Before any manual mutation:

kubectl -n joelclaw exec inngest-0 -- sqlite3 /data/main.db '.backup /data/main.db.pre-sweep-<ts>.sqlite'

No exceptions.

4) Define stale-run terminalization contract

For candidates that are stale/non-cancellable:

  1. insert missing terminal history row: FunctionCancelled
  2. ensure function_finishes row exists
  3. then set trace_runs.status=500 (cancelled)

Never update trace_runs.status alone.

5) Operator verification gate

After any sweep:

  • verify sample IDs with joelclaw run <run-id>
  • verify joelclaw runs --status RUNNING narrows to genuinely active runs
  • verify fresh canary events complete normally

Consequences

Good

  • deterministic recovery path for orphaned run metadata,
  • less operator thrash when cancelRun cannot resolve stale IDs,
  • explicit guardrails for invasive runtime repair.

Tradeoffs

  • manual DB surgery is high-risk and requires discipline,
  • list/detail mismatch may still occur until CLI/server-side sweep tooling is automated.

Non-goals

  • direct ad-hoc edits to unrelated Inngest tables,
  • replacing standard cancel/retry paths during normal operation,
  • skipping backup due to time pressure.

Required Skills (Preflight)

  • joelclaw
  • system-bus
  • k8s
  • o11y-logging

Implementation Plan (vector clock)

  1. V1: capture incident and runbook in skills/docs.
  2. V2: add explicit CLI stale-run diagnostics (RUN_STALE_SDK_UNREACHABLE).
  3. V3: ship joelclaw inngest sweep-stale-runs with dry-run + backup + apply modes.

Verification Checklist

  • stale candidate query identifies only old non-terminal health runs (10 found: 7 o11y-triage, 3 system-health)
  • backup artifact exists before mutation (/data/main.db.pre-sweep-20260304T192943.sqlite, verified=true)
  • terminal history + finish rows inserted for all candidates (10/10)
  • run detail for sampled IDs resolves terminal status (RUNNING list: 0)
  • fresh system/health.requested run completes successfully (COMPLETED)

Shipped Notes (2026-03-04)

CLI command joelclaw inngest sweep-stale-runs was already fully implemented. The only missing piece was sqlite3 CLI in the Inngest container (base image ships libsqlite3-0 but not the CLI tool). Fixed by:

  1. apt-get install sqlite3 in running pod (immediate)
  2. Custom Dockerfile at k8s/inngest/Dockerfile extending inngest/inngest (permanent)

Operational Note (2026-03-04)

A separate loop execution (loop-mmanwnzv-b66xus) targeting ADR-0194 stories was found in a CHAIN_BROKEN state before story 1 work began (lost judge→plan handoff). It was cancelled during backlog cleanup. This did not change ADR-0194 status because the sweep command had already shipped; the cancelled loop is treated as stale execution metadata, not incomplete architecture work.