ADR-0207accepted

Restate Durable Execution Engine

Status

Accepted

Context and Problem Statement

joelclaw currently runs 110+ durable functions on self-hosted Inngest (k8s StatefulSet). It works, but reliability has felt flaky and operational confidence is lower than desired for increasingly critical orchestration workloads.

On 2026-03-04, a spike in packages/restate-spike/ validated Restate as a viable durable execution engine for joelclaw. Three representative patterns were proven:

  1. Durable step chains with ctx.run() mapping 1:1 to Inngest step.run()
  2. Fan-out/fan-in via ctx.serviceClient() with cleaner orchestration than event fan-out ✅
  3. Human-in-the-loop signaling via ctx.promise() + external .resolve() with better ergonomics than step.waitForEvent()

Restate runs as a single Rust binary and does not require Postgres. This aligns with ADR-0205’s AWS-mirrors-local principle and maps cleanly to Step Functions-class workflow execution.

Decision

Adopt Restate as the durable execution engine for new workloads, while running a dual-stack transition with Inngest during migration.

Phase 1: Dual-Run (immediate)

  • Deploy Restate server to the k8s cluster as a StatefulSet.
  • Port the swarm DAG orchestrator (ADR-0060) to Restate as first production workload.
  • Port an approval/human-in-the-loop workflow as second production workload.
  • Run Inngest and Restate side-by-side.
  • Compare reliability, observability, DX, and failure recovery.

Phase 2: New Workloads Default to Restate

  • Build all new durable workflows as Restate services/workflows.
  • Keep existing Inngest functions running during transition.
  • Add an adapter layer for event-driven triggers where needed (Inngest remains stronger in event-native patterns).
  • Extend joelclaw CLI to inspect Restate runs alongside Inngest runs.

Phase 3: Migration (gated on Phase 2 success)

  • Migrate high-value Inngest workflows first, based on impact and operational pain.
  • Evaluate target steady state:
    • Inngest retained for event fan-out only, or
    • Full migration to Restate (+ Redis/pub-sub adapters where needed).
  • Decision gate: determine whether Inngest remains a core runtime.

API Mapping Reference

InngestRestateNotes
step.run("name", fn)ctx.run("name", fn)1:1
step.sleep("1h")ctx.sleep({ hours: 1 })1:1
step.invoke("fn", data)ctx.serviceClient(svc).method(data)RPC-style, more explicit
step.waitForEvent("event")ctx.promise("key") + .resolve()Promise model, more ergonomic
step.sendEvent("event", data)ctx.serviceSendClient(svc).method(data)Fire-and-forget
Event trigger ("app/user.created")HTTP call through Restate serverRestate is RPC-native, not event-native
Inngest dashboardRestate admin API + CLIInngest UI is richer currently
inngest.createFunction()restate.service() / restate.workflow()Different mental models

Kubernetes Deployment

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: restate
  namespace: joelclaw
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: restate
          image: restatedev/restate:1.6.2
          ports:
            - containerPort: 8080 # Ingress (send requests)
            - containerPort: 9070 # Admin API
            - containerPort: 9071 # Metrics
          volumeMounts:
            - name: restate-data
              mountPath: /restate-data
  volumeClaimTemplates:
    - metadata:
        name: restate-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi

Worker services register with Restate via:

restate deployments register http://<service>:<port>

AWS Equivalent

LocalAWS
Restate StatefulSetAWS Step Functions / Restate Cloud
Restate admin APIStep Functions console
k8s Service endpointsLambda / ECS endpoints

Consequences

Positive

  • Single Rust binary with low operational surface.
  • Direct Step Functions-like mental model for workflow design.
  • Strong orchestration ergonomics for DAGs (service calls over event fan-out).
  • Promise/signal model is natural for approval and human-in-the-loop workflows.
  • Durable replay semantics equivalent to current Inngest guarantees.
  • Dual-run approach lowers migration risk.

Negative

  • Two durable engines must operate concurrently during transition.
  • 110+ Inngest functions create a large migration surface if full cutover is chosen.
  • Restate is RPC-native; event-native patterns need an adapter.
  • Smaller community/ecosystem than Inngest.
  • Operator tooling is currently less mature than Inngest dashboard UX.

Risks

  • Restate project/platform stability over time.
  • Some event-centric Inngest workloads may map poorly.
  • Dual-stack period adds operational complexity.

Implementation Progress (2026-03-06)

  • Added runtime manifest: k8s/restate.yaml (StatefulSet + Service, pinned restatedev/restate:1.6.2, PVC, probes).
  • Applied manifest to joelclaw namespace and verified pod readiness (restate-0 Running 1/1).
  • Added production Restate package surface in packages/restate/ with:
    • deployGate workflow (approval-gated deploy pipeline)
    • dagOrchestrator workflow + dagWorker service (fan-out/fan-in DAG execution by dependency wave)
  • Added ops scripts:
    • scripts/restate/register-deployment.sh
    • scripts/restate/start.sh (canonical headless host launcher with env loading, SIGTERM forwarding, stale-port cleanup, and opportunistic deployment re-register)
    • scripts/restate/test-workflow.sh (deployGate smoke)
    • scripts/restate/test-dag-workflow.sh (DAG smoke)
  • Added canonical host launchd asset: infra/launchd/com.joel.restate-worker.plist so the long-running Restate worker is no longer an ad-hoc nohup bun run ... shell.
  • Added queue-drainer stall watchdog in packages/restate/src/queue-drainer.ts: if Redis backlog remains but progress stops past QUEUE_DRAIN_STALL_AFTER_MS, the worker emits queue.drainer.stalled and exits non-zero so launchd can restart the host runtime and replay the backlog instead of leaving queued traffic wedged behind a still-listening Bun process.
  • Added trigger scripts:
    • packages/restate/src/trigger-deploy.ts
    • packages/restate/src/trigger-dag.ts
  • Added CLI visibility: joelclaw restate status, joelclaw restate deployments, and joelclaw restate smoke.
  • Updated smoke contract:
    • default joelclaw restate smoke validates deployGate (tag + rollout)
    • DAG validation runs via joelclaw restate smoke --script scripts/restate/test-dag-workflow.sh

Deploy Gate (2026-03-05 → 2026-03-06)

First production workload: packages/restate/src/workflows/deploy-gate.ts. Wraps system-bus-worker deploy pipeline with human approval gate. Durable flow: auth → build → push → notify → approval promise/reminders → manifest update → kubectl apply → rollout verify → Inngest sync → outcome notification.

Channel-agnostic notification interface (NotificationChannel) with Telegram and Console implementations. Callback routing via restate:{service}:{workflowId}:{action} button data format.

DAG Workload (2026-03-06)

First production fan-out/fan-in DAG workload: packages/restate/src/workflows/dag-orchestrator.ts.

  • dagOrchestrator.run validates node graph, computes topological waves, then executes each wave in parallel via ctx.serviceClient(dagWorker).execute(...).
  • dagWorker.execute is a durable per-node service handler (ctx.run + optional sleep).
  • Verified on cluster with successful invocation IDs:
    • inv_123AVDN6LYLq4eJ09TRKj4fjFBXcftUwWy
    • inv_1bML0iAbN2hF2bov0QPLWrZ2ShghBA1LLB

Open items remain for full ADR-0060 swarm YAML orchestration and Phase 1 metrics comparison.

Verification

  • Restate server deployed to k8s cluster.
  • Swarm orchestrator running on Restate in production.
  • Approval workflow running on Restate (deploy gate — first production workload).
  • DAG fan-out/fan-in workload running on Restate in production (dagOrchestrator + dagWorker).
  • joelclaw CLI can inspect Restate runtime/deployments (restate status, restate deployments).
  • End-to-end deployGate smoke passes against cluster runtime (joelclaw restate smoke).
  • End-to-end DAG smoke passes against cluster runtime (joelclaw restate smoke --script scripts/restate/test-dag-workflow.sh).
  • Failure recovery tested: kill worker mid-execution and verify replay. Proven 2026-03-05: PID 4820→5691 step chain, PID 9231→11509 promise/signal.
  • Channel interface (NotificationChannel) with Telegram and Console implementations.
  • Escalating reminder loop (peek + sleep) proven with 4-tier escalation.
  • Duplicate callback (409) safety for multi-message workflows.
  • Deploy gate end-to-end test with successful build+push+approve+rollout.
  • Phase 1 metrics documented (reliability, latency, DX assessment).