Workflow Runtime Deployment Strategy + Runtime Alternatives Research Plan
Status
Proposed
Context
We have recurrent loop stalls around runtime start/restart windows. Current failure signatures:
CHAIN_BROKENandORPHANED_CLAIMinjoelclaw loop diagnose allagent-loop-test-writerruns stuck inRUNNING- Inngest trace finalization error:
Unable to reach SDK URL - Recovery currently requires active repair (
loop diagnose --fix, stale sweep)
This is orchestration/finalization fragility during startup or deployment transitions, not story logic failure.
Background summary
joelclaw’s Inngest layer is the durable execution backbone (agent loops, gateway fan-out, cron checks, memory/discovery/content pipelines). Recent incidents showed restart/cutover fragility concentrated around finalization callback reachability and stale run state, not business-logic correctness. This ADR therefore prioritizes an 80/20 reliability uplift first (tight deploy/runtime gates on current Inngest path), while running a parallel OSS alternatives analysis for a possible combo-runtime future.
Related ADR lineage
- ADR-0156 — Graceful worker restart (accepted baseline for safer worker transitions)
- ADR-0195 — Mandatory memory participation contract
- ADR-0196 — Cancel-on-new-message behavior
- ADR-0197 — Sub-agent execution via
step.invoke - ADR-0198 — Adaptive heartbeat behavior
- ADR-0199 — Close the loop / reflect-brain operational contract
- ADR-0200 — Force-Enforce-Verify pattern (policy backbone for mandatory gates)
Decision
Approved execution is two explicit parallel tracks (A+B):
- Track A (80/20 now, Inngest-only): deploy hardening gates with minimal added complexity, focused on preventing startup/finalization gaps.
- Track B (in parallel): evaluate OSS runtime alternatives with a scored matrix and cluster-fit analysis to decide a combo architecture, not a forced one-engine migration.
Track A ships now. Track B runs concurrently and informs a follow-up recommendation ADR.
Current joelclaw workload profile (ground truth)
Scan of packages/system-bus/src/inngest/functions (134 files):
step.run: 609step.sendEvent: 97step.sleep: 11step.invoke: 3cancelOn: 10concurrency: 81throttle: 9debounce: 2cron: 39step.waitForEvent: 0step.waitForSignal: 0
What this implies
Our runtime needs are mostly:
- Durable step execution + retries
- Event-driven fan-out / chaining
- Strong concurrency controls + cancellation semantics
- Cron/scheduled workflows
- Long-running task finalization reliability
- Good run introspection for operations
We are not currently dependent on waitForEvent/waitForSignal semantics in core loops.
Track A (80/20 now): Inngest deploy hardening gates (minimal complexity)
A1. Immediate must-have gate set
| Gate | Pass condition | Failure action |
|---|---|---|
| Warm | New worker registers and stays healthy for warm window | Block cutover |
| Callback probe | Synthetic finalization callback succeeds end-to-end | Block release |
| Drain | Old worker enters no-new-work mode and in-flight work drops below threshold | Hold shutdown |
| Bake | Post-cutover bake window shows no stale-run/chain-break regression | Mark deploy unhealthy |
| Rollback-on-breach | Any guardrail breach auto-reactivates prior worker | Automatic rollback |
| Single CLI verify surface | joelclaw deploy verify-runtime returns all gate statuses + final verdict | Deploy fails if command fails |
A2. Immediate implementation set (scope lock)
- Add
joelclaw deploy verify-runtimeas the single verification command (JSON output with per-gate status). - Wire deploy scripts to require, in order: warm → callback probe → drain → cutover → bake.
- Add rollback hook that re-enables prior worker on any gate breach.
- Gate health on existing loop integrity checks (
loop diagnose+ stale-run thresholds). - Emit OTEL stage events (
deploy.runtime.warm,deploy.runtime.callback_probe,deploy.runtime.drain,deploy.runtime.bake,deploy.runtime.rollback).
A3. Deferred hardening (not in immediate set)
- Progressive traffic cutover percentages (10/50/100)
- Artifact promotion attestations and hash gates
- Expanded chaos automation beyond callback/finalization class
Track B (in parallel): OSS runtime alternatives matrix + combo architecture
B1. Candidate set
- Temporal (
temporalio/sdk-typescript) - Trigger.dev (
triggerdotdev/trigger.dev) - Hatchet (
hatchet-dev/hatchet) - Restate (
restatedev/restate)
B2. Scored matrix (joelclaw fit + translation risk)
Scoring rubric: 1 = weak, 5 = strong. Translation risk is inverse (1 = low rewrite risk, 5 = high rewrite risk).
| Runtime | Workload fit (1-5) | Cluster-fit (1-5) | Translation risk (1-5, high=hard) | Notes for combo architecture |
|---|---|---|---|---|
| Hatchet | 4 | 4 | 3 | Strong first pilot candidate for queue/event-heavy shadow workloads |
| Temporal | 5 | 3 | 5 | Highest durability ceiling; best kept for later targeted critical flows |
| Trigger.dev | 4 | 3 | 4 | Good TS ergonomics; validate self-host SLO behavior before broad use |
| Restate | 3 | 3 | 5 | Interesting exactly-once model; higher conceptual translation cost |
B3. Combo recommendation framing (explicit)
- Do not force a one-engine migration.
- Keep Inngest as the default engine for current production event fan-out and cron-heavy workflows.
- Evaluate a second engine per-workload where it materially improves reliability or operability.
- Use adapter boundaries and shadow mode first; migrate only proven workload slices.
B4. Runtime notes (current evidence)
- Temporal: strongest maturity and durability model, heavier operational and translation burden.
- Trigger.dev: TS-native DX and durable task model, but cluster reliability must be proven under our restart/finalization failure class.
- Hatchet: pragmatic self-host posture with flow control that maps well to current workload shape.
- Restate: compelling exactly-once semantics, but requires the largest programming-model translation.
B5. Required research packages (in parallel with Track A)
- Feature compatibility mapping for top-25 critical joelclaw workflows.
- Restart/finalization failure-mode bakeoff per candidate.
- Operational cost model (runtime footprint, deploy complexity, observability parity).
- Migration blast-radius analysis with rollback path for each slice.
B6. Exit criteria for follow-up recommendation ADR
- Scored matrix updated with measured evidence from pilot runs
- At least one finalization-chaos test per candidate class
- Incremental migration/combination plan with rollback by workload slice
Pilot recommendation (low-risk in-cluster shadow workload)
Recommended pilot
Shadow the githubWorkflowRunCompleted → webhook subscription dispatch path in-cluster.
Why this workload
- Already cluster-scoped and operationally isolated
- Low customer blast radius compared to core loop/story execution
- Event-driven shape closely matches the orchestration behaviors we need to validate
Pilot constraints
- Mirror source events to a shadow stream; no user-facing side effects.
- Candidate runtime writes only shadow outputs (
runtime.shadow.*) for parity comparison. - Run 14-day bake with parity, latency, stale-run, and recovery metrics against Inngest baseline.
Pilot success criteria
- ≥99% event/output parity with baseline
- No unbounded RUNNING ghosts in pilot runtime
- Clean rollback and disable path validated during bake
References (content-bearing)
- Michael T. Nygard, Release It!
- Chapter 14: Start-up and Shutdown
- Chapter 17: Releases Shouldn’t Hurt
- Site Reliability Engineering: How Google Runs Production
- Chapter 24: Distributed Periodic Scheduling with Cron
- Sam Newman, Building Microservices (2nd Edition)
- Chapter 7: Build Pipelines and Continuous Delivery
- Chapter 8: Progressive Delivery
- Temporal TypeScript SDK repository: <https://github.com/temporalio/sdk-typescript>
- Trigger.dev repository: <https://github.com/triggerdotdev/trigger.dev>
- Hatchet repository: <https://github.com/hatchet-dev/hatchet>
- Restate repository: <https://github.com/restatedev/restate>
Consequences
Positive
- Immediate reliability gains come from a narrow, enforceable Track A gate set.
- Runtime exploration continues in parallel without destabilizing current production.
- Decision quality improves via scored fit/risk comparison and shadow pilot evidence.
Negative
- Deploy latency increases due to warm/probe/bake gates.
- Operating a combo architecture can add tooling and observability overhead.
- Runtime migration remains incremental, not instant.
Risks and unknowns
- Self-host reliability claims from OSS projects must be validated under our exact restart/finalization failure mode.
- Cancellation/concurrency semantics may require non-trivial adapter translation in alternatives.
- Running two runtimes in parallel increases operational surface area.
Next actions (immediate)
- Implement Track A must-have gates and
joelclaw deploy verify-runtime. - Start Track B pilot prep for the selected in-cluster shadow workload.
- Run shadow pilot and publish scorecard evidence.
- Draft follow-up recommendation ADR with confidence level and migration slices.