ADR-0201proposed

Workflow Runtime Deployment Strategy + Runtime Alternatives Research Plan

2026-03-03T00:00:00.000Z

Context

We have recurrent loop stalls around runtime start/restart windows. Current failure signatures:

CHAIN_BROKEN and ORPHANED_CLAIM in joelclaw loop diagnose all
agent-loop-test-writer runs stuck in RUNNING
Inngest trace finalization error: Unable to reach SDK URL
Recovery currently requires active repair (loop diagnose --fix, stale sweep)

This is orchestration/finalization fragility during startup or deployment transitions, not story logic failure.

joelclaw’s Inngest layer is the durable execution backbone (agent loops, gateway fan-out, cron checks, memory/discovery/content pipelines). Recent incidents showed restart/cutover fragility concentrated around finalization callback reachability and stale run state, not business-logic correctness. This ADR therefore prioritizes an 80/20 reliability uplift first (tight deploy/runtime gates on current Inngest path), while running a parallel OSS alternatives analysis for a possible combo-runtime future.

ADR-0156 — Graceful worker restart (accepted baseline for safer worker transitions)
ADR-0195 — Mandatory memory participation contract
ADR-0196 — Cancel-on-new-message behavior
ADR-0197 — Sub-agent execution via step.invoke
ADR-0198 — Adaptive heartbeat behavior
ADR-0199 — Close the loop / reflect-brain operational contract
ADR-0200 — Force-Enforce-Verify pattern (policy backbone for mandatory gates)

Decision

Approved execution is two explicit parallel tracks (A+B):

Track A (80/20 now, Inngest-only): deploy hardening gates with minimal added complexity, focused on preventing startup/finalization gaps.
Track B (in parallel): evaluate OSS runtime alternatives with a scored matrix and cluster-fit analysis to decide a combo architecture, not a forced one-engine migration.

Track A ships now. Track B runs concurrently and informs a follow-up recommendation ADR.

Current joelclaw workload profile (ground truth)

Scan of packages/system-bus/src/inngest/functions (134 files):

step.run: 609
step.sendEvent: 97
step.sleep: 11
step.invoke: 3
cancelOn: 10
concurrency: 81
throttle: 9
debounce: 2
cron: 39
step.waitForEvent: 0
step.waitForSignal: 0

What this implies

Our runtime needs are mostly:

Durable step execution + retries
Event-driven fan-out / chaining
Strong concurrency controls + cancellation semantics
Cron/scheduled workflows
Long-running task finalization reliability
Good run introspection for operations

We are not currently dependent on waitForEvent/waitForSignal semantics in core loops.

Track A (80/20 now): Inngest deploy hardening gates (minimal complexity)

A1. Immediate must-have gate set

Gate	Pass condition	Failure action
Warm	New worker registers and stays healthy for warm window	Block cutover
Callback probe	Synthetic finalization callback succeeds end-to-end	Block release
Drain	Old worker enters no-new-work mode and in-flight work drops below threshold	Hold shutdown
Bake	Post-cutover bake window shows no stale-run/chain-break regression	Mark deploy unhealthy
Rollback-on-breach	Any guardrail breach auto-reactivates prior worker	Automatic rollback
Single CLI verify surface	`joelclaw deploy verify-runtime` returns all gate statuses + final verdict	Deploy fails if command fails

A2. Immediate implementation set (scope lock)

Add joelclaw deploy verify-runtime as the single verification command (JSON output with per-gate status).
Wire deploy scripts to require, in order: warm → callback probe → drain → cutover → bake.
Add rollback hook that re-enables prior worker on any gate breach.
Gate health on existing loop integrity checks (loop diagnose + stale-run thresholds).
Emit OTEL stage events (deploy.runtime.warm, deploy.runtime.callback_probe, deploy.runtime.drain, deploy.runtime.bake, deploy.runtime.rollback).

A3. Deferred hardening (not in immediate set)

Progressive traffic cutover percentages (10/50/100)
Artifact promotion attestations and hash gates
Expanded chaos automation beyond callback/finalization class

Track B (in parallel): OSS runtime alternatives matrix + combo architecture

B1. Candidate set

Temporal (temporalio/sdk-typescript)
Trigger.dev (triggerdotdev/trigger.dev)
Hatchet (hatchet-dev/hatchet)
Restate (restatedev/restate)

B2. Scored matrix (joelclaw fit + translation risk)

Scoring rubric: 1 = weak, 5 = strong. Translation risk is inverse (1 = low rewrite risk, 5 = high rewrite risk).

Runtime	Workload fit (1-5)	Cluster-fit (1-5)	Translation risk (1-5, high=hard)	Notes for combo architecture
Hatchet	4	4	3	Strong first pilot candidate for queue/event-heavy shadow workloads
Temporal	5	3	5	Highest durability ceiling; best kept for later targeted critical flows
Trigger.dev	4	3	4	Good TS ergonomics; validate self-host SLO behavior before broad use
Restate	3	3	5	Interesting exactly-once model; higher conceptual translation cost

B3. Combo recommendation framing (explicit)

Do not force a one-engine migration.
Keep Inngest as the default engine for current production event fan-out and cron-heavy workflows.
Evaluate a second engine per-workload where it materially improves reliability or operability.
Use adapter boundaries and shadow mode first; migrate only proven workload slices.

B4. Runtime notes (current evidence)

Temporal: strongest maturity and durability model, heavier operational and translation burden.
Trigger.dev: TS-native DX and durable task model, but cluster reliability must be proven under our restart/finalization failure class.
Hatchet: pragmatic self-host posture with flow control that maps well to current workload shape.
Restate: compelling exactly-once semantics, but requires the largest programming-model translation.

B5. Required research packages (in parallel with Track A)

Feature compatibility mapping for top-25 critical joelclaw workflows.
Restart/finalization failure-mode bakeoff per candidate.
Operational cost model (runtime footprint, deploy complexity, observability parity).
Migration blast-radius analysis with rollback path for each slice.

B6. Exit criteria for follow-up recommendation ADR

Scored matrix updated with measured evidence from pilot runs
At least one finalization-chaos test per candidate class
Incremental migration/combination plan with rollback by workload slice

Already cluster-scoped and operationally isolated
Low customer blast radius compared to core loop/story execution
Event-driven shape closely matches the orchestration behaviors we need to validate

Pilot constraints

Mirror source events to a shadow stream; no user-facing side effects.
Candidate runtime writes only shadow outputs (runtime.shadow.*) for parity comparison.
Run 14-day bake with parity, latency, stale-run, and recovery metrics against Inngest baseline.

Pilot success criteria

≥99% event/output parity with baseline
No unbounded RUNNING ghosts in pilot runtime
Clean rollback and disable path validated during bake

References (content-bearing)

Michael T. Nygard, Release It!
- Chapter 14: Start-up and Shutdown
- Chapter 17: Releases Shouldn’t Hurt
Site Reliability Engineering: How Google Runs Production
- Chapter 24: Distributed Periodic Scheduling with Cron
Sam Newman, Building Microservices (2nd Edition)
- Chapter 7: Build Pipelines and Continuous Delivery
- Chapter 8: Progressive Delivery
Temporal TypeScript SDK repository: <https://github.com/temporalio/sdk-typescript>
Trigger.dev repository: <https://github.com/triggerdotdev/trigger.dev>
Hatchet repository: <https://github.com/hatchet-dev/hatchet>
Restate repository: <https://github.com/restatedev/restate>

Consequences

Positive

Immediate reliability gains come from a narrow, enforceable Track A gate set.
Runtime exploration continues in parallel without destabilizing current production.
Decision quality improves via scored fit/risk comparison and shadow pilot evidence.

Negative

Deploy latency increases due to warm/probe/bake gates.
Operating a combo architecture can add tooling and observability overhead.
Runtime migration remains incremental, not instant.

Risks and unknowns

Self-host reliability claims from OSS projects must be validated under our exact restart/finalization failure mode.
Cancellation/concurrency semantics may require non-trivial adapter translation in alternatives.
Running two runtimes in parallel increases operational surface area.

Next actions (immediate)

Implement Track A must-have gates and joelclaw deploy verify-runtime.
Start Track B pilot prep for the selected in-cluster shadow workload.
Run shadow pilot and publish scorecard evidence.
Draft follow-up recommendation ADR with confidence level and migration slices.