ADR-0194shipped

Inngest Runtime SQLite Forensics and Stale-Run Sweep Contract

2026-03-02T00:00:00.000Z

Status: shipped
Date: 2026-03-02
Deciders: Joel, Panda
Relates to: ADR-0089, ADR-0139, ADR-0182, ADR-0191

Context

During a runtime restart window, repeated SDK dispatch failures (Unable to reach SDK URL, EOF writing request to SDK) left historical Inngest runs in an inconsistent state:

joelclaw runs --status RUNNING surfaced old health jobs,
cancelRun returned not found for those IDs,
some run details resolved as terminal while list metadata stayed stale.

This is a mask-layer mismatch between list metadata and terminal execution records.

Decision

1) Establish runtime truth hierarchy

For stale-run incidents, truth order is:

joelclaw run <run-id> trace + errors
Inngest terminal history rows (FunctionCompleted|FunctionFailed|FunctionCancelled)
function_finishes
runs list metadata

List output is not authoritative when these disagree.

2) Codify raw DB access path

Inngest runtime state is SQLite in the k8s StatefulSet pod/PVC:

pod: inngest-0
db: /data/main.db

This DB is the canonical recovery surface when GraphQL cancel endpoints cannot locate stale executions.

3) Require backup-first mutation discipline

Before any manual mutation:

kubectl -n joelclaw exec inngest-0 -- sqlite3 /data/main.db '.backup /data/main.db.pre-sweep-<ts>.sqlite'

No exceptions.

4) Define stale-run terminalization contract

For candidates that are stale/non-cancellable:

insert missing terminal history row: FunctionCancelled
ensure function_finishes row exists
then set trace_runs.status=500 (cancelled)

Never update trace_runs.status alone.

5) Operator verification gate

After any sweep:

verify sample IDs with joelclaw run <run-id>
verify joelclaw runs --status RUNNING narrows to genuinely active runs
verify fresh canary events complete normally

Consequences

Good

deterministic recovery path for orphaned run metadata,
less operator thrash when cancelRun cannot resolve stale IDs,
explicit guardrails for invasive runtime repair.

Tradeoffs

manual DB surgery is high-risk and requires discipline,
list/detail mismatch may still occur until CLI/server-side sweep tooling is automated.

Non-goals

direct ad-hoc edits to unrelated Inngest tables,
replacing standard cancel/retry paths during normal operation,
skipping backup due to time pressure.

Required Skills (Preflight)

joelclaw
system-bus
k8s
o11y-logging

Implementation Plan (vector clock)

V1: capture incident and runbook in skills/docs.
V2: add explicit CLI stale-run diagnostics (RUN_STALE_SDK_UNREACHABLE).
V3: ship joelclaw inngest sweep-stale-runs with dry-run + backup + apply modes.

Verification Checklist

stale candidate query identifies only old non-terminal health runs (10 found: 7 o11y-triage, 3 system-health)
backup artifact exists before mutation (/data/main.db.pre-sweep-20260304T192943.sqlite, verified=true)
terminal history + finish rows inserted for all candidates (10/10)
run detail for sampled IDs resolves terminal status (RUNNING list: 0)
fresh system/health.requested run completes successfully (COMPLETED)

CLI command joelclaw inngest sweep-stale-runs was already fully implemented. The only missing piece was sqlite3 CLI in the Inngest container (base image ships libsqlite3-0 but not the CLI tool). Fixed by:

apt-get install sqlite3 in running pod (immediate)
Custom Dockerfile at k8s/inngest/Dockerfile extending inngest/inngest (permanent)

Operational Note (2026-03-04)

A separate loop execution (loop-mmanwnzv-b66xus) targeting ADR-0194 stories was found in a CHAIN_BROKEN state before story 1 work began (lost judge→plan handoff). It was cancelled during backlog cleanup. This did not change ADR-0194 status because the sweep command had already shipped; the cancelled loop is treated as stale execution metadata, not incomplete architecture work.