Inngest Runtime SQLite Forensics and Stale-Run Sweep Contract
- Status: shipped
- Date: 2026-03-02
- Deciders: Joel, Panda
- Relates to: ADR-0089, ADR-0139, ADR-0182, ADR-0191
Context
During a runtime restart window, repeated SDK dispatch failures (Unable to reach SDK URL, EOF writing request to SDK) left historical Inngest runs in an inconsistent state:
joelclaw runs --status RUNNINGsurfaced old health jobs,cancelRunreturnednot foundfor those IDs,- some run details resolved as terminal while list metadata stayed stale.
This is a mask-layer mismatch between list metadata and terminal execution records.
Decision
1) Establish runtime truth hierarchy
For stale-run incidents, truth order is:
joelclaw run <run-id>trace + errors- Inngest terminal history rows (
FunctionCompleted|FunctionFailed|FunctionCancelled) function_finishesrunslist metadata
List output is not authoritative when these disagree.
2) Codify raw DB access path
Inngest runtime state is SQLite in the k8s StatefulSet pod/PVC:
- pod:
inngest-0 - db:
/data/main.db
This DB is the canonical recovery surface when GraphQL cancel endpoints cannot locate stale executions.
3) Require backup-first mutation discipline
Before any manual mutation:
kubectl -n joelclaw exec inngest-0 -- sqlite3 /data/main.db '.backup /data/main.db.pre-sweep-<ts>.sqlite'No exceptions.
4) Define stale-run terminalization contract
For candidates that are stale/non-cancellable:
- insert missing terminal history row:
FunctionCancelled - ensure
function_finishesrow exists - then set
trace_runs.status=500(cancelled)
Never update trace_runs.status alone.
5) Operator verification gate
After any sweep:
- verify sample IDs with
joelclaw run <run-id> - verify
joelclaw runs --status RUNNINGnarrows to genuinely active runs - verify fresh canary events complete normally
Consequences
Good
- deterministic recovery path for orphaned run metadata,
- less operator thrash when
cancelRuncannot resolve stale IDs, - explicit guardrails for invasive runtime repair.
Tradeoffs
- manual DB surgery is high-risk and requires discipline,
- list/detail mismatch may still occur until CLI/server-side sweep tooling is automated.
Non-goals
- direct ad-hoc edits to unrelated Inngest tables,
- replacing standard cancel/retry paths during normal operation,
- skipping backup due to time pressure.
Required Skills (Preflight)
joelclawsystem-busk8so11y-logging
Implementation Plan (vector clock)
- V1: capture incident and runbook in skills/docs.
- V2: add explicit CLI stale-run diagnostics (
RUN_STALE_SDK_UNREACHABLE). - V3: ship
joelclaw inngest sweep-stale-runswith dry-run + backup + apply modes.
Verification Checklist
- stale candidate query identifies only old non-terminal health runs (10 found: 7 o11y-triage, 3 system-health)
- backup artifact exists before mutation (
/data/main.db.pre-sweep-20260304T192943.sqlite, verified=true) - terminal history + finish rows inserted for all candidates (10/10)
- run detail for sampled IDs resolves terminal status (RUNNING list: 0)
- fresh
system/health.requestedrun completes successfully (COMPLETED)
Shipped Notes (2026-03-04)
CLI command joelclaw inngest sweep-stale-runs was already fully implemented. The only
missing piece was sqlite3 CLI in the Inngest container (base image ships libsqlite3-0
but not the CLI tool). Fixed by:
apt-get install sqlite3in running pod (immediate)- Custom Dockerfile at
k8s/inngest/Dockerfileextendinginngest/inngest(permanent)
Operational Note (2026-03-04)
A separate loop execution (loop-mmanwnzv-b66xus) targeting ADR-0194 stories was found in a CHAIN_BROKEN state before story 1 work began (lost judge→plan handoff). It was cancelled during backlog cleanup. This did not change ADR-0194 status because the sweep command had already shipped; the cancelled loop is treated as stale execution metadata, not incomplete architecture work.