Restate Durable Execution Engine
Status
Accepted
Context and Problem Statement
joelclaw currently runs 110+ durable functions on self-hosted Inngest (k8s StatefulSet). It works, but reliability has felt flaky and operational confidence is lower than desired for increasingly critical orchestration workloads.
On 2026-03-04, a spike in packages/restate-spike/ validated Restate as a viable durable execution engine for joelclaw. Three representative patterns were proven:
- Durable step chains with
ctx.run()mapping 1:1 to Inngeststep.run()✅ - Fan-out/fan-in via
ctx.serviceClient()with cleaner orchestration than event fan-out ✅ - Human-in-the-loop signaling via
ctx.promise()+ external.resolve()with better ergonomics thanstep.waitForEvent()✅
Restate runs as a single Rust binary and does not require Postgres. This aligns with ADR-0205’s AWS-mirrors-local principle and maps cleanly to Step Functions-class workflow execution.
Decision
Adopt Restate as the durable execution engine for new workloads, while running a dual-stack transition with Inngest during migration.
Phase 1: Dual-Run (immediate)
- Deploy Restate server to the k8s cluster as a StatefulSet.
- Port the swarm DAG orchestrator (ADR-0060) to Restate as first production workload.
- Port an approval/human-in-the-loop workflow as second production workload.
- Run Inngest and Restate side-by-side.
- Compare reliability, observability, DX, and failure recovery.
Phase 2: New Workloads Default to Restate
- Build all new durable workflows as Restate services/workflows.
- Keep existing Inngest functions running during transition.
- Add an adapter layer for event-driven triggers where needed (Inngest remains stronger in event-native patterns).
- Extend
joelclawCLI to inspect Restate runs alongside Inngest runs.
Phase 3: Migration (gated on Phase 2 success)
- Migrate high-value Inngest workflows first, based on impact and operational pain.
- Evaluate target steady state:
- Inngest retained for event fan-out only, or
- Full migration to Restate (+ Redis/pub-sub adapters where needed).
- Decision gate: determine whether Inngest remains a core runtime.
API Mapping Reference
| Inngest | Restate | Notes |
|---|---|---|
step.run("name", fn) | ctx.run("name", fn) | 1:1 |
step.sleep("1h") | ctx.sleep({ hours: 1 }) | 1:1 |
step.invoke("fn", data) | ctx.serviceClient(svc).method(data) | RPC-style, more explicit |
step.waitForEvent("event") | ctx.promise("key") + .resolve() | Promise model, more ergonomic |
step.sendEvent("event", data) | ctx.serviceSendClient(svc).method(data) | Fire-and-forget |
Event trigger ("app/user.created") | HTTP call through Restate server | Restate is RPC-native, not event-native |
| Inngest dashboard | Restate admin API + CLI | Inngest UI is richer currently |
inngest.createFunction() | restate.service() / restate.workflow() | Different mental models |
Kubernetes Deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: restate
namespace: joelclaw
spec:
replicas: 1
template:
spec:
containers:
- name: restate
image: restatedev/restate:1.6.2
ports:
- containerPort: 8080 # Ingress (send requests)
- containerPort: 9070 # Admin API
- containerPort: 9071 # Metrics
volumeMounts:
- name: restate-data
mountPath: /restate-data
volumeClaimTemplates:
- metadata:
name: restate-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10GiWorker services register with Restate via:
restate deployments register http://<service>:<port>AWS Equivalent
| Local | AWS |
|---|---|
| Restate StatefulSet | AWS Step Functions / Restate Cloud |
| Restate admin API | Step Functions console |
| k8s Service endpoints | Lambda / ECS endpoints |
Consequences
Positive
- Single Rust binary with low operational surface.
- Direct Step Functions-like mental model for workflow design.
- Strong orchestration ergonomics for DAGs (service calls over event fan-out).
- Promise/signal model is natural for approval and human-in-the-loop workflows.
- Durable replay semantics equivalent to current Inngest guarantees.
- Dual-run approach lowers migration risk.
Negative
- Two durable engines must operate concurrently during transition.
- 110+ Inngest functions create a large migration surface if full cutover is chosen.
- Restate is RPC-native; event-native patterns need an adapter.
- Smaller community/ecosystem than Inngest.
- Operator tooling is currently less mature than Inngest dashboard UX.
Risks
- Restate project/platform stability over time.
- Some event-centric Inngest workloads may map poorly.
- Dual-stack period adds operational complexity.
Implementation Progress (2026-03-06)
- Added runtime manifest:
k8s/restate.yaml(StatefulSet + Service, pinnedrestatedev/restate:1.6.2, PVC, probes). - Applied manifest to
joelclawnamespace and verified pod readiness (restate-0Running 1/1). - Added production Restate package surface in
packages/restate/with:deployGateworkflow (approval-gated deploy pipeline)dagOrchestratorworkflow +dagWorkerservice (fan-out/fan-in DAG execution by dependency wave)
- Added ops scripts:
scripts/restate/register-deployment.shscripts/restate/start.sh(canonical headless host launcher with env loading, SIGTERM forwarding, stale-port cleanup, and opportunistic deployment re-register)scripts/restate/test-workflow.sh(deployGate smoke)scripts/restate/test-dag-workflow.sh(DAG smoke)
- Added canonical host launchd asset:
infra/launchd/com.joel.restate-worker.plistso the long-running Restate worker is no longer an ad-hocnohup bun run ...shell. - Added queue-drainer stall watchdog in
packages/restate/src/queue-drainer.ts: if Redis backlog remains but progress stops pastQUEUE_DRAIN_STALL_AFTER_MS, the worker emitsqueue.drainer.stalledand exits non-zero so launchd can restart the host runtime and replay the backlog instead of leaving queued traffic wedged behind a still-listening Bun process. - Added trigger scripts:
packages/restate/src/trigger-deploy.tspackages/restate/src/trigger-dag.ts
- Added CLI visibility:
joelclaw restate status,joelclaw restate deployments, andjoelclaw restate smoke. - Updated smoke contract:
- default
joelclaw restate smokevalidates deployGate (tag + rollout) - DAG validation runs via
joelclaw restate smoke --script scripts/restate/test-dag-workflow.sh
- default
Deploy Gate (2026-03-05 → 2026-03-06)
First production workload: packages/restate/src/workflows/deploy-gate.ts. Wraps system-bus-worker deploy pipeline with human approval gate. Durable flow: auth → build → push → notify → approval promise/reminders → manifest update → kubectl apply → rollout verify → Inngest sync → outcome notification.
Channel-agnostic notification interface (NotificationChannel) with Telegram and Console implementations. Callback routing via restate:{service}:{workflowId}:{action} button data format.
DAG Workload (2026-03-06)
First production fan-out/fan-in DAG workload: packages/restate/src/workflows/dag-orchestrator.ts.
dagOrchestrator.runvalidates node graph, computes topological waves, then executes each wave in parallel viactx.serviceClient(dagWorker).execute(...).dagWorker.executeis a durable per-node service handler (ctx.run+ optional sleep).- Verified on cluster with successful invocation IDs:
inv_123AVDN6LYLq4eJ09TRKj4fjFBXcftUwWyinv_1bML0iAbN2hF2bov0QPLWrZ2ShghBA1LLB
Open items remain for full ADR-0060 swarm YAML orchestration and Phase 1 metrics comparison.
Verification
- Restate server deployed to k8s cluster.
- Swarm orchestrator running on Restate in production.
- Approval workflow running on Restate (deploy gate — first production workload).
- DAG fan-out/fan-in workload running on Restate in production (
dagOrchestrator+dagWorker). -
joelclawCLI can inspect Restate runtime/deployments (restate status,restate deployments). - End-to-end deployGate smoke passes against cluster runtime (
joelclaw restate smoke). - End-to-end DAG smoke passes against cluster runtime (
joelclaw restate smoke --script scripts/restate/test-dag-workflow.sh). - Failure recovery tested: kill worker mid-execution and verify replay. Proven 2026-03-05: PID 4820→5691 step chain, PID 9231→11509 promise/signal.
- Channel interface (NotificationChannel) with Telegram and Console implementations.
- Escalating reminder loop (peek + sleep) proven with 4-tier escalation.
- Duplicate callback (409) safety for multi-message workflows.
- Deploy gate end-to-end test with successful build+push+approve+rollout.
- Phase 1 metrics documented (reliability, latency, DX assessment).