ADR-0201proposed

Workflow Runtime Deployment Strategy + Runtime Alternatives Research Plan

Status

Proposed

Context

We have recurrent loop stalls around runtime start/restart windows. Current failure signatures:

  • CHAIN_BROKEN and ORPHANED_CLAIM in joelclaw loop diagnose all
  • agent-loop-test-writer runs stuck in RUNNING
  • Inngest trace finalization error: Unable to reach SDK URL
  • Recovery currently requires active repair (loop diagnose --fix, stale sweep)

This is orchestration/finalization fragility during startup or deployment transitions, not story logic failure.

Background summary

joelclaw’s Inngest layer is the durable execution backbone (agent loops, gateway fan-out, cron checks, memory/discovery/content pipelines). Recent incidents showed restart/cutover fragility concentrated around finalization callback reachability and stale run state, not business-logic correctness. This ADR therefore prioritizes an 80/20 reliability uplift first (tight deploy/runtime gates on current Inngest path), while running a parallel OSS alternatives analysis for a possible combo-runtime future.

  • ADR-0156 — Graceful worker restart (accepted baseline for safer worker transitions)
  • ADR-0195 — Mandatory memory participation contract
  • ADR-0196 — Cancel-on-new-message behavior
  • ADR-0197 — Sub-agent execution via step.invoke
  • ADR-0198 — Adaptive heartbeat behavior
  • ADR-0199 — Close the loop / reflect-brain operational contract
  • ADR-0200 — Force-Enforce-Verify pattern (policy backbone for mandatory gates)

Decision

Approved execution is two explicit parallel tracks (A+B):

  1. Track A (80/20 now, Inngest-only): deploy hardening gates with minimal added complexity, focused on preventing startup/finalization gaps.
  2. Track B (in parallel): evaluate OSS runtime alternatives with a scored matrix and cluster-fit analysis to decide a combo architecture, not a forced one-engine migration.

Track A ships now. Track B runs concurrently and informs a follow-up recommendation ADR.


Current joelclaw workload profile (ground truth)

Scan of packages/system-bus/src/inngest/functions (134 files):

  • step.run: 609
  • step.sendEvent: 97
  • step.sleep: 11
  • step.invoke: 3
  • cancelOn: 10
  • concurrency: 81
  • throttle: 9
  • debounce: 2
  • cron: 39
  • step.waitForEvent: 0
  • step.waitForSignal: 0

What this implies

Our runtime needs are mostly:

  1. Durable step execution + retries
  2. Event-driven fan-out / chaining
  3. Strong concurrency controls + cancellation semantics
  4. Cron/scheduled workflows
  5. Long-running task finalization reliability
  6. Good run introspection for operations

We are not currently dependent on waitForEvent/waitForSignal semantics in core loops.


Track A (80/20 now): Inngest deploy hardening gates (minimal complexity)

A1. Immediate must-have gate set

GatePass conditionFailure action
WarmNew worker registers and stays healthy for warm windowBlock cutover
Callback probeSynthetic finalization callback succeeds end-to-endBlock release
DrainOld worker enters no-new-work mode and in-flight work drops below thresholdHold shutdown
BakePost-cutover bake window shows no stale-run/chain-break regressionMark deploy unhealthy
Rollback-on-breachAny guardrail breach auto-reactivates prior workerAutomatic rollback
Single CLI verify surfacejoelclaw deploy verify-runtime returns all gate statuses + final verdictDeploy fails if command fails

A2. Immediate implementation set (scope lock)

  1. Add joelclaw deploy verify-runtime as the single verification command (JSON output with per-gate status).
  2. Wire deploy scripts to require, in order: warm → callback probe → drain → cutover → bake.
  3. Add rollback hook that re-enables prior worker on any gate breach.
  4. Gate health on existing loop integrity checks (loop diagnose + stale-run thresholds).
  5. Emit OTEL stage events (deploy.runtime.warm, deploy.runtime.callback_probe, deploy.runtime.drain, deploy.runtime.bake, deploy.runtime.rollback).

A3. Deferred hardening (not in immediate set)

  • Progressive traffic cutover percentages (10/50/100)
  • Artifact promotion attestations and hash gates
  • Expanded chaos automation beyond callback/finalization class

Track B (in parallel): OSS runtime alternatives matrix + combo architecture

B1. Candidate set

  • Temporal (temporalio/sdk-typescript)
  • Trigger.dev (triggerdotdev/trigger.dev)
  • Hatchet (hatchet-dev/hatchet)
  • Restate (restatedev/restate)

B2. Scored matrix (joelclaw fit + translation risk)

Scoring rubric: 1 = weak, 5 = strong. Translation risk is inverse (1 = low rewrite risk, 5 = high rewrite risk).

RuntimeWorkload fit (1-5)Cluster-fit (1-5)Translation risk (1-5, high=hard)Notes for combo architecture
Hatchet443Strong first pilot candidate for queue/event-heavy shadow workloads
Temporal535Highest durability ceiling; best kept for later targeted critical flows
Trigger.dev434Good TS ergonomics; validate self-host SLO behavior before broad use
Restate335Interesting exactly-once model; higher conceptual translation cost

B3. Combo recommendation framing (explicit)

  • Do not force a one-engine migration.
  • Keep Inngest as the default engine for current production event fan-out and cron-heavy workflows.
  • Evaluate a second engine per-workload where it materially improves reliability or operability.
  • Use adapter boundaries and shadow mode first; migrate only proven workload slices.

B4. Runtime notes (current evidence)

  • Temporal: strongest maturity and durability model, heavier operational and translation burden.
  • Trigger.dev: TS-native DX and durable task model, but cluster reliability must be proven under our restart/finalization failure class.
  • Hatchet: pragmatic self-host posture with flow control that maps well to current workload shape.
  • Restate: compelling exactly-once semantics, but requires the largest programming-model translation.

B5. Required research packages (in parallel with Track A)

  1. Feature compatibility mapping for top-25 critical joelclaw workflows.
  2. Restart/finalization failure-mode bakeoff per candidate.
  3. Operational cost model (runtime footprint, deploy complexity, observability parity).
  4. Migration blast-radius analysis with rollback path for each slice.

B6. Exit criteria for follow-up recommendation ADR

  • Scored matrix updated with measured evidence from pilot runs
  • At least one finalization-chaos test per candidate class
  • Incremental migration/combination plan with rollback by workload slice

Pilot recommendation (low-risk in-cluster shadow workload)

Shadow the githubWorkflowRunCompleted → webhook subscription dispatch path in-cluster.

Why this workload

  • Already cluster-scoped and operationally isolated
  • Low customer blast radius compared to core loop/story execution
  • Event-driven shape closely matches the orchestration behaviors we need to validate

Pilot constraints

  1. Mirror source events to a shadow stream; no user-facing side effects.
  2. Candidate runtime writes only shadow outputs (runtime.shadow.*) for parity comparison.
  3. Run 14-day bake with parity, latency, stale-run, and recovery metrics against Inngest baseline.

Pilot success criteria

  • ≥99% event/output parity with baseline
  • No unbounded RUNNING ghosts in pilot runtime
  • Clean rollback and disable path validated during bake

References (content-bearing)


Consequences

Positive

  • Immediate reliability gains come from a narrow, enforceable Track A gate set.
  • Runtime exploration continues in parallel without destabilizing current production.
  • Decision quality improves via scored fit/risk comparison and shadow pilot evidence.

Negative

  • Deploy latency increases due to warm/probe/bake gates.
  • Operating a combo architecture can add tooling and observability overhead.
  • Runtime migration remains incremental, not instant.

Risks and unknowns

  1. Self-host reliability claims from OSS projects must be validated under our exact restart/finalization failure mode.
  2. Cancellation/concurrency semantics may require non-trivial adapter translation in alternatives.
  3. Running two runtimes in parallel increases operational surface area.

Next actions (immediate)

  1. Implement Track A must-have gates and joelclaw deploy verify-runtime.
  2. Start Track B pilot prep for the selected in-cluster shadow workload.
  3. Run shadow pilot and publish scorecard evidence.
  4. Draft follow-up recommendation ADR with confidence level and migration slices.