ADR-0217accepted

Agent-First Event-Driven Workflows

2026-03-06T00:00:00.000Z

Status

Accepted

Accepted on 2026-03-07 after the architecture direction was proven live enough to stop pretending this was still speculative. Phase 1 (deterministic core + pilot cutovers + soak/operator proof), Phase 2 (bounded Haiku queue triage with live enforce proof on discovery + GitHub), and Phase 3 (Sonnet observation/control with earned pause/resume proof) are shipped for their scoped goals; Phases 4–6 remain open. Do not mark this ADR shipped overall until the agent-first workload interface is real and Inngest is actually decommissioned.

Intent

Build an event-driven system where LLM intelligence is the workflow engine, not just a consumer of one.

Traditional workflow engines (Inngest, Temporal, BullMQ) are developer-first: you configure static rules, they execute. Agent-first means the system reasons about what to do, when to do it, and whether it’s working — the way a competent human operator would. The human supervises; the system operates.

Removing Inngest is a side effect, not the goal. The goal is a system that handles event flow with judgment — contextual priority, adaptive throttling, self-diagnosis — so Joel stops thinking about queue plumbing entirely.

Desired Outcomes

The system operates like a competent operator. Specifically:

Events are triaged, not just routed. The system decides priority based on content, context, time of day, and system state — not a static number someone hardcoded six months ago.
Queue discipline adapts to reality. When a downstream API is rate-limited, the system backs off. When the cluster is degraded, non-critical work defers. When it’s 3am, background sync waits. No human adjusts knobs.
Anomalies are diagnosed, not just retried. A handler failing 8 times in 2 hours gets root-cause analysis, not exponential backoff into silence. The system tells Joel what’s wrong and what it tried.
The operator interface is natural language. “Queue depth is growing because Restate worker has high latency — I’ve paused tier-3 work and will resume when latency drops below 200ms.” Via Telegram, not a dashboard.
Inngest is fully decommissioned. All 110+ functions, 144 event types, migrated. One less pod, one less failure mode, one less external dependency for core behavior.

How We Know It Worked

Joel goes a month without manually adjusting priority, restarting a stuck queue, or debugging why an event didn’t fire
System self-heals from transient failures (rate limits, pod restarts, NAS blips) without human intervention
When something genuinely novel breaks, Joel gets a Telegram message with diagnosis and proposed fix — not a generic error alert
Inngest pod is deleted from k8s, all events flow through the new stack, zero regression

The Insight

joelclaw has been converging on this architecture across 18 ADRs without naming it:

ADR	What it does	Queue behavior it exhibits
0018	Redis event bridge	Message bus
0038	Gateway daemon	Always-on intelligent agent
0062	Heartbeat triage	Periodic system observation
0068	Proposal auto-triage	LLM-powered admission control (Haiku)
0078	Model tier registry	Cost-aware model selection
0090	O11y triage loop	Self-healing, anomaly routing
0207	Restate execution	Durable, journaled workflows
0216	Dkron scheduler	Cron triggers

40 files in system-bus already call infer(). The model tiers already exist in models.ts. The gateway already triages, routes, and self-heals. This ADR names what already exists and wires it into a unified architecture.

Decision

Architecture: Three Layers

                    ┌──────────────────────────────────────┐
                    │  Intelligence Layer (LLM model tiers) │
                    │                                       │
                    │  Haiku:  triage every event (~100ms)  │
                    │  Sonnet: observe system state (~30s)  │
                    │  Opus:   diagnose anomalies (on-demand)│
                    │                                       │
                    │  ★ Not required for correctness.      │
                    │    Offline → queue drains FIFO.       │
                    └──────────────┬───────────────────────┘
                                   │ scores, pauses, batches
                    ┌──────────────▼───────────────────────┐
                    │  Deterministic Core (~150 lines)      │
                    │                                       │
                    │  Redis sorted sets (priority queue)   │
                    │  Static registry (event → handlers)   │
                    │  Lua scripts (concurrency, rate limit)│
                    │  OTEL (every transition observable)   │
                    │                                       │
                    │  ★ Always correct. Can't break.       │
                    │    Works without the intelligence layer│
                    └──────────────┬───────────────────────┘
                                   │ HTTP /send
                    ┌──────────────▼───────────────────────┐
                    │  Restate (durable execution)          │
                    │                                       │
                    │  Journaled steps, exactly-once        │
                    │  DAG workflows, approval gates        │
                    │  Handlers use infer() as needed       │
                    └──────────────────────────────────────┘
 
    Dkron (cron) ─┐
    Webhooks      ├──→ emit() ──→ Intelligence ──→ Core ──→ Restate
    CLI           │                    ↑
    Restate emit ─┘        (Haiku triage or passthrough)

What “Agent-First” Means Concretely

Developer-first (Inngest):

// Static config. Set once, forget, hope it's still right.
createFunction(
  {
    id: "process-video",
    concurrency: { limit: 2 }, // hardcoded
    throttle: { limit: 10, period: "1m" }, // hardcoded
    priority: { run: "event.data.priority" }, // static field
  },
  { event: "video/requested" },
  handler,
);

Agent-first (joelclaw):

Event: video/requested arrives
 
Haiku (100ms): "Video download. NAS is healthy, no other video jobs queued,
  it's daytime. Priority: normal. Route to: video-ingest handler."
 
→ ZADD jc:q:normal {score} {event} → ZPOPMIN → POST to Restate
 
Later, 5 videos arrive in 30 seconds:
 
Sonnet (periodic): "5 video downloads queued in 30s. NAS write throughput
  is 660 MiB/s but transcription is CPU-bound. Setting concurrency to 2,
  queueing the rest. Notifying Joel: '5 videos queued, processing 2 at a time,
  ETA ~15 minutes for all.'"
 
Later, Mux API starts returning 429s:
 
Sonnet: "Mux rate limited. Pausing video-ingest queue. Will retry in 3 minutes.
  Other queues unaffected."
 
Later, video-ingest has failed 8 consecutive times:
 
Opus: "Video ingest failing: Mux webhook URL returns 502. Checked: Vercel
  deploy is healthy, but the webhook route has a new middleware that's crashing
  on multipart payloads. Likely cause: commit abc123 from yesterday's loop.
  Proposed fix: revert the middleware change. Creating Todoist task."

The intelligence isn’t a chatbot bolted onto a queue. It IS the queue discipline. Every decision a human operator would make — “should I process this now?”, “is the system healthy enough?”, “why does this keep failing?” — is made by the appropriate model tier.

Model Tiers

Tier	Model	When	Cost/day	Does what
Triage	Haiku	Every event	~$0.02	Classify, score priority, dedup, confirm routing
Routing	Sonnet	Every 30-60s	~$0.10	Observe system state, adjust queues, batch, throttle
Orchestration	Opus	On anomaly	~$0.03	Root cause analysis, self-heal, escalate
Total			~$0.15

At 500 events/day, the LLM is the competitive advantage: it makes decisions no static system can match, and the cost is a rounding error. At 500k events/day this architecture would be absurd. We’re not building for 500k.

The Deterministic Core

~150 lines. Redis primitives. The correctness guarantee.

Need	Redis primitive
Priority queue	`ZADD` / `ZPOPMIN` sorted sets
Concurrency	`INCR`/`DECR` + Lua guard
Rate limit	Token bucket Lua (~40 lines)
Debounce	`SET key NX EX ttl`
Fan-out	Registry lookup + loop
Batch	`ZADD` + threshold/timer drain

Static event→handler registry in queue/registry.json (git-tracked). Without models, the core drains FIFO by static priority. Models offline = dumb queue. Models online = smart queue. Both correct.

CLI: The Operator Interface

joelclaw emit <event> [-d data]      # enqueue
joelclaw queue depth                  # all queues
joelclaw queue list <name>            # pending events
joelclaw queue pause <name>           # pause
joelclaw queue drain <name>           # force drain
joelclaw queue priority <event> -s N  # reprioritize
joelclaw diagnose <queue|event>       # trigger Opus analysis

Alternatives Considered

BullMQ

Battle-tested Redis job queue with priority, concurrency, rate limiting. Rejected: Most of its value (retries, progress, stalled detection) is redundant with Restate. It’s the right answer for a system without durable execution or LLM intelligence. joelclaw has both. Adding BullMQ would mean paying for a library to get sorted sets and rate limiting — things Redis does natively in a few Lua scripts — while gaining zero intelligent queue management.

NATS JetStream

Lightweight messaging with persistence. Rejected: New infrastructure (another pod). No built-in priority, rate limiting, or debounce. We’d build all flow control on top anyway.

Raw Redis Streams

Append-only logs with consumer groups. Rejected: No priority ordering (FIFO only). Building priority on Streams requires multiple streams per level — sorted sets are the correct primitive.

Bespoke Redis without model layer

Same deterministic core, static config only. Not rejected — this is the fallback. If models prove unreliable for queue management, the core still works with hardcoded priority and concurrency. The model layer is additive. It can be disabled without breaking correctness.

Implementation Phases

Lifecycle note (2026-03-08): ADR status is accepted because the queue architecture and observation/control pilot are now proven live enough to stop treating them as speculative. Phases 1–3 are shipped scoped work. Phases 4–6 remain open work.

Detailed execution PRD: ADR-0217 Phase 1 Queue Execution Plan

Phase 2 execution PRD: ADR-0217 Phase 2 Triage Execution Plan

Phase 3 execution PRD: ADR-0217 Phase 3 Sonnet Observation PRD

Phase 4 execution PRD: ADR-0217 Phase 4 Agent-First Workload Ergonomics PRD

Runtime dependency PRD: Sandboxed Story Execution for Restate-Driven Work

Next batch PRD: ADR-0217 Next Batch — Restate-Native Sandboxed Workloads

Those queue execution PRDs are the concrete build sheets for the deterministic core first and the bounded Haiku triage layer next. The new next-batch PRD is the bridge from proven queue/control surfaces into one real Restate-native sandboxed workload lane.

Execution substrate boundary (2026-03-06, updated 2026-03-07)

This ADR governs event ingress, routing, prioritization, and queue discipline.
It does not choose shared-workspace execution as the long-term file-mutation surface.
Autonomous story execution that changes code must follow ADR-0205 and ADR-0206, with the sandbox runtime PRD as the concrete build sheet.
ADR-0172 reservations remain coordination support only; they do not replace filesystem isolation.
The execution substrate is now in a clearer middle state: the local host-worker sandbox runner is proven and live, but the k8s Job runner is still next.
That means ADR-0217 queue work should proceed on the deterministic queue/drainer path now, while keeping the k8s runtime swap scoped to ADR-0205/0206 follow-on work.

Phase 1: Deterministic core

Ship packages/queue/. Prove events flow webhook → Redis → Restate → completion → OTEL.

Registry, emit(), drainer loop, Lua scripts, CLI commands, OTEL
Migrate 5 high-traffic events off Inngest
Current proof (2026-03-07): gateway-stopped queue drain and kill/restart replay are both now verified live for discovery/noted.
Current pilot cutovers (2026-03-07): all five Phase-1 pilot families now have reversible queue cutovers: discovery ingress behind QUEUE_PILOTS=discovery, discovery follow-up emission behind QUEUE_PILOTS=discovery-captured, launchd-backed content updates behind QUEUE_PILOTS=content, aggregate subscription requests behind QUEUE_PILOTS=subscriptions, and GitHub webhook ingress behind QUEUE_PILOTS=github for github/workflow_run.completed.
Current GitHub ingress proof (2026-03-07): a signed POST to /webhooks/github returned { queued: 1, direct: 0 }, the shared queue drained back to zero, Restate emitted queue.dispatch.started|completed, and downstream webhook-subscription OTEL proved the forwarded github/workflow_run.completed event hit Inngest.
Current discovery-captured proof (2026-03-07): a real discovery/noted event for https://github.com/mksglu/claude-context-mode drove discovery-capture to emit discovery.capture.forwarded with mode: queue, the queue drained back to zero, Restate emitted queue.dispatch.started|completed, and downstream X Discovery Hook, Typesense: Queue Vault Re-index, and system/content-sync runs proved the forwarded discovery/captured event hit Inngest consumers.
Current content-updated proof (2026-03-07): com.joel.content-sync-watcher now resolves from repo-canonical infra/launchd/com.joel.content-sync-watcher.plist and scripts/content-sync-watcher.sh; touching Vault/docs/decisions/0217-event-routing-queue-discipline.md caused the watcher to emit content/updated via joelclaw queue emit, Restate dispatched stream 1772905340608-0, and downstream system/content-sync plus Typesense: Queue Vault Re-index runs confirmed the event reached Inngest consumers.
Operator-path correction earned by proof: the content pilot surfaced a real CLI bug where joelclaw queue inspect <stream-id> crashed after ack because the error-envelope helper assumed nextActions was defined. The operator contract is now explicit: missing/expired messages return QUEUE_MESSAGE_MISSING with queue-state next actions instead of exploding.
Story 5 operator surface (2026-03-07): joelclaw queue stats is now the first soak/cutover CLI for the new queue path. It summarizes recent queue.dispatch.started|completed|failed OTEL from the Restate drainer into current queue depth, terminal success/failure counts, waitTimeMs latency percentiles, dispatch-duration percentiles, promotion counts, top event families, and recent failures so Joel can answer “is the queue healthy enough to widen cutover?” without spelunking Redis or raw Typesense queries.
Live queue-stats proof (2026-03-07): joelclaw queue stats --hours 1 on the installed CLI returned currentDepth.total=0, dispatches.started=9, dispatches.completed=9, dispatches.failed=0, queueLatencyMs.p95=1721, and queueLatencyMs.withinTarget=true on real pilot-family traffic. That is the first Story 5 cutover-gate readout earned from the operator surface itself.
Live burst proof (2026-03-07): a 12-event GitHub workflow_run.completed burst was sent through the real webhook ingress with valid signatures. Every request came back queued (queued: 1, direct: 0), joelclaw queue depth observed backlog 11 → … → 1 → 0 over roughly 22 seconds, and post-burst joelclaw queue stats --hours 1 showed dispatches.started=22, dispatches.completed=22, dispatches.failed=0, and github/workflow_run.completed delta +12. So the burst-drains-to-zero gate is now proved. The same window also showed queueLatencyMs.p95=20205 / withinTarget=false, which means the queue survives the burst but had not yet earned the latency gate for that burst-conditioned hour window.
Throughput retune (2026-03-07): the right fix was the drainer, not the dashboard. Root cause: after a successful dispatch the drainer still waited for the next QUEUE_DRAIN_INTERVAL_MS heartbeat before leasing the next ready message, so concurrency 1 effectively meant one send every ~2 seconds even when Restate accepted work in ~100-200ms. The drainer now self-pulses immediately when backlog remains and a slot frees; the interval is only the idle poll / retry heartbeat. After restarting the local Restate worker, the same 12-event GitHub burst drained 11 → 9 → 6 → 4 → 2 → 0 in about 4556ms, and joelclaw queue stats --hours 1 --limit 24 isolated the fresh sample with dispatches.started=12, dispatches.completed=12, dispatches.failed=0, queueLatencyMs.p95=3718, and withinTarget=true.
Soak hygiene correction (2026-03-07): the ugly 8-10 minute content/updated waits that re-poisoned the soak window were traced to an ad-hoc host Restate lifecycle, not a new throughput regression. The live worker that came back at 19:09:53Z was a manual nohup bun run packages/restate/src/index.ts process (not a canonical launchd service), so its opaque SIGTERM/restart dirtied the same queue-stats window. The fix was to move the long-running host Restate worker onto repo-managed launchd assets: infra/launchd/com.joel.restate-worker.plist plus scripts/restate/start.sh. The wrapper loads shared env, prevents headless CHANNEL=console, forwards SIGTERM to Bun, cleans stale :9080 listeners, and opportunistically re-registers the deployment. Story 5 soak windows that span queue.drainer.started during rollout are therefore dirty by definition; post-rollout evidence should start from the supervised launchd runtime.
Anchored soak operator fix (2026-03-07): joelclaw queue stats now accepts --since <iso|ms> so operators can pin the sample lower bound to a known-clean moment instead of mixing fresh pilot traffic with a dirty pre-fix hour window. This was immediately proved on the installed CLI after fixing one real parser bug in the option handling.
Fresh supervised burst proof (2026-03-07): using the new anchored window (joelclaw queue stats --since 1772916656015 --limit 64), a supervised 24-event GitHub workflow_run.completed burst produced dispatches.started=24, dispatches.completed=24, dispatches.failed=0, queueLatencyMs.p95=2829, dispatchDurationMs.p95=229, withinTarget=true, and live queue depth returned to 0. That re-earns the burst-conditioned latency gate on the clean post-launchd runtime instead of the earlier dirty hour window.
Discovery blocker root cause + fix (2026-03-07): the suspicious QUEUE_PILOTS=discovery joelclaw discover <url> behavior turned out not to be a lying CLI path; the queue drainer had actually stalled inside an otherwise-running launchd Restate worker. Raw Redis showed queued content/updated + discovery/noted entries piling up with no fresh queue.dispatch.* telemetry, and a launchctl kickstart -k gui/$(id -u)/com.joel.restate-worker immediately replayed the backlog back to zero. The fix was a drainer watchdog: when backlog remains but no progress occurs past QUEUE_DRAIN_STALL_AFTER_MS, the worker now emits queue.drainer.stalled and exits non-zero so launchd can recover and replay instead of letting pilot traffic wedge silently.
Story 5 closeout proof (2026-03-07): after restarting the watchdog-enabled worker at 2026-03-07T21:23:47.592Z, a clean anchored window proved the previously suspect discovery ingress plus the GitHub burst path on the supervised runtime: QUEUE_PILOTS=discovery joelclaw discover https://example.com/?adr217-final-discovery=... produced a queued discovery/noted, downstream discovery-capture completed, X Discovery Hook completed, and the same anchored joelclaw queue stats --since 1772918627592 --limit 128 sample later showed dispatches.started=15, dispatches.completed=15, dispatches.failed=0, queueLatencyMs.p95=1584, dispatchDurationMs.p95=201, withinTarget=true, currentDepth.total=0, with event families led by github/workflow_run.completed and including both discovery/noted and discovery/captured. The content watcher also continued to drive system/content-sync in that same clean window, while the subscriptions pilot remained backed by its earlier dedicated live queue proof. That closes the no-silent-drops gate honestly enough to call Story 5 done.
Registry rule clarified by bug fix: queue drainer handler.target values for type: "inngest" must be concrete Inngest event names, not function ids. Using subscription/check-feeds or github/workflow-run-completed there is wrong because the drainer posts events.
Done when: done on 2026-03-07 for the Phase-1 scope — five pilot families now have reversible cutovers, live drain/replay proofs, a CLI-first soak surface, and a supervised drainer recovery path instead of silent wedge behavior.

Phase 2: Haiku triage

Execution PRD: ADR-0217 Phase 2 Triage Execution Plan

Wire Haiku into the ingest path. Every event triaged before it hits the sorted set.

Triage prompt: system context + event payload → priority score + handler confirmation
Dedup, bounded route confirmation/mismatch signal, OTEL with reasoning
Operator surface reports suggested vs applied decisions, fallback reasons, and latency
Fallback: Haiku unavailable → static priority from registry
Live Story 5 closeout proof (2026-03-07): with QUEUE_TRIAGE_MODE=shadow, QUEUE_TRIAGE_FAMILIES=discovery,github, and QUEUE_TRIAGE_ENFORCE_FAMILIES=discovery,github enabled in ~/.config/system-bus.env, the anchored window joelclaw queue stats --since 1772926820195 showed dispatches.started=9, dispatches.completed=9, dispatches.failed=0, queueLatencyMs.p95=1768, triage attempts=7, fallbacks=1, fallbackByReason=[timeout:1], and appliedChanges=3 on the GitHub burst path. The deliberate 1-second timeout drill still returned a queued discovery/noted with fallbackReason=timeout, and joelclaw webhook stream whs_mmgys95ob3rremex --timeout 10 --replay 10 replayed all three signed GitHub burst events downstream.
Honest caveat (2026-03-07): global joelclaw queue depth initially remained 22, but joelclaw queue list --limit 30 showed the residue was unrelated older content/updated + pre-soak discovery backlog. The fresh Story 5 discovery/noted stream ids were already missing from joelclaw queue inspect, so the enforced-family sample drained even though the overall queue was not pristine.
Backlog cleanup closeout (2026-03-07): the supervised com.joel.restate-worker restart did not reap those 22 stale never-claimed entries, so a bounded manual cleanup used @joelclaw/queue ack() to remove only orphaned content/updated + discovery/noted stream ids older than 1 hour after confirming zero pending leases. joelclaw queue depth then returned to 0 and joelclaw queue list went empty, so the Story 5 caveat is now operationally closed.
Done when: done on 2026-03-07 for the Phase 2 scope — events now get contextual priority scoring with bounded fallback/apply behavior visible in OTEL + CLI, and the two earned pilot families have live enforced mode without blocking enqueue

Phase 3: Sonnet observation

Execution PRD: ADR-0217 Phase 3 Sonnet Observation PRD

Wire Sonnet into the host-worker queue observation layer as a periodic observer with gateway-aware reporting.

Reads queue depth, Restate load, OTEL error rates, time/sleep state, and active deterministic pauses
Actions: reprioritize, pause, resume, batch, shed, escalate
Telegram reporting in natural language
Current Story 1 proof (2026-03-07): monorepo commit c4a9dc8172091d6171f83f46218e076ec8d05b3b added canonical queue-observation contracts to @joelclaw/queue, packages/system-bus/src/lib/queue-observe.ts now owns the bounded Sonnet prompt/schema/fallback helpers plus queue.observe.started|completed|failed|fallback and queue.control.applied|expired|rejected, and focused tests/docs now pin the observation layer as contract-only work before any deterministic control plane exists
Current Story 2 proof (2026-03-07): monorepo commit c3a31458ddfeac61c0f4b5a4ea8230bb1176472d added the installed CLI dry-run operator surface joelclaw queue observe; an anchored live proof window (since=1772930468718) returned decision.suggestedActions=[noop], appliedCount=0, history.attempts=1, history.completed=1, history.fallbacks=0, and control.available=false, while the latest raw queue-observe OTEL event was queue.observe.completed
Current Story 3 proof (2026-03-08): monorepo commit 7d50d389d0d6e0ff0ec6003611037ee2462a5d2c shipped deterministic queue-control state plus manual CLI apply/inspect surfaces. The first live proof attempts exposed a real operator-truth bug: queue commands were honoring an ambient shell REDIS_URL and writing pause state to the wrong Redis while the worker/drainer still used localhost from ~/.config/system-bus.env. After wiring redisUrl into the canonical CLI config and rebuilding the installed binary, the anchored pause/resume window (since=1772932667032) proved content/updated stream 1772932667379-0 stayed queued for 5 seconds while paused and then drained after manual resume, and the TTL window (since=1772932701625) proved queue.control.expired from both joelclaw queue control status and raw queue-control OTEL.
Current Story 4 code truth (2026-03-08): monorepo commits d637c12dc7149e30d0547eb55886371bd1ad332e, 353b81d1aa6f04d52c6c596b70d796525d0ac3a2, 9741bc1f8cd72735f16d39482af81a5f8505b046, and f59ac37d now cover the full bounded observer path: host-worker queue/observer, QUEUE_OBSERVER_* flags, active-pause-aware snapshots, tolerant summary trimming, the hardened prompt/parser contract, and the settled-observer-pause health fix that deterministically emits resume_family instead of reading an intentional hold as permanent downstream failure.
Current Story 4 live truth (2026-03-08): dry-run is earned on host, bounded content backlog probes now complete with real queue/observer outputs, and manual queue/observer.requested probes no longer wait behind the cron pass because they run through the separate read-only queue/observer-requested singleton path. Empty snapshots — even when an old manual pause is still active — now short-circuit to deterministic noop instead of burning a full Sonnet call, and idle empty snapshots with no recent failures now report downstreamState=healthy instead of a noisy inherited degraded label. The first supervised enforce canary anchored at since=1772981290859 temporarily booted Restate out, seeded 30 queued content/updated events, and let the cron observer see a real down/backlogged snapshot. On snapshot cca656f7-a9ce-4ca2-9f6d-0ed332f56a4d, the observer auto-applied pause_family, emitted queue.control.applied, queued the Telegram operator report, and the remaining backlog drained back to depth 0 after a supervised manual resume. The patched follow-up canary anchored at since=1772985057594 then proved the automatic resume leg too: it paused on snapshot 1cb24e7b-f0cd-4e0c-ae5d-27cb4934b49a, auto-resumed on snapshot 151aa03a-fced-41f0-9a54-2f3d1a70856d / run 01KK72HD0EMT3T34K8QP3SMEW9, and drained the held content item back to queue depth 0 without manual cleanup. The steady-state worker was then rolled back to QUEUE_OBSERVER_MODE=dry-run, so enforce remains a deliberate drill rather than the default runtime posture.
Current Story 5 soak truth (2026-03-08): yes, the soak has been running for a while. joelclaw queue observe --hours 12 found 947 raw queue.observe.* OTEL documents (CLI sample truncated to the latest 200 docs), with the sampled history showing attempts=101, completed=99, failed=0, fallbacks=0, and successRate=1. joelclaw queue stats --hours 12 showed dispatches.started=66, dispatches.completed=66, dispatches.failed=0, and currentDepth.total=0. The explicit off-mode sanity window is now also earned: with QUEUE_OBSERVER_MODE=off anchored at since=1772984026035, three queued content/updated canary events drained cleanly back to depth 0, queue stats showed dispatches.started=3 / completed=3 / failed=0 with withinTarget=true, queue control stayed empty, and the latest cron queue/observer run (01KK71AYHGWNMCR3J5Z20WN614) returned { status: "disabled", mode: "off" }. The automatic resume leg is now earned too: the supervised follow-up canary anchored at since=1772985057594 produced a full automatic pause_family → resume_family cycle on content/updated with queue depth returning to 0.
Human sanity pass: Joel reviewed the earned evidence and replied ok ready to proceed, which closes the last Story 5 gate and makes the Phase 4 handoff explicit.
Current blocker: none for Story 5. The queue observer pilot has now earned the runtime proofs plus operator sign-off.
Done when: Story 5 is done — and now is done — once automatic pause/resume, off-mode sanity, and Joel’s operator sanity pass all land on the same truthful timeline

Phase 4: Agent-first workload ergonomics

Make coding and repo work legible from the operator surface instead of forcing agents to reason about the runtime substrate first.

Ship one obvious front door for workload planning and dispatch (plan, run, status, explain, cancel)
Capture Joel steering as a stable workload-planning contract
Make serial, parallel, and chained execution first-class coding/repo workload shapes
Keep runtime selection behind the boundary: inline, durable, sandboxed, looped, or blocked should be planner decisions, not caller archaeology
Narrow substrate skills like restate-workflows back to substrate work and give coding agents a proper front-door skill
Current Story 1 truth (2026-03-08): docs/workloads.md now defines the first canonical workload vocabulary plus request / plan / handoff schema, and skills/agent-workloads/ is the front-door skill for using that contract.
Current Story 2 truth (2026-03-08): monorepo commit 507ce5703faa90060b98bbc33ba8683881f81e97 ships planner-only joelclaw workload plan. It emits the canonical request + plan envelope, infers kind / shape / mode / backend when the caller leaves them open, and keeps run|status|explain|cancel explicitly unshipped. Dogfooding immediately earned follow-up heuristic fixes: mixed implementation+docs intents no longer collapse into repo.docs, sandbox comparisons no longer force sandbox mode unless isolation is actually requested, extend ... verify ... then update README now stays implementation-shaped, and nouns like published skills no longer imply deploy-allowed.
Current Story 3 truth (2026-03-08): monorepo commits ffdaa6e285dc35e2e14249d2c5f8f5c50a9a23c1 and 45a246f0 now ship the first real planning + dispatch ergonomics slice inside joelclaw workload: presets, --paths-from status|head|recent:<n>, --write-plan JSON artifacts, prompt-aware acceptance preservation from Acceptance:, chained milestone decomposition from Goal:, optional reflect and update plan stages, less trigger-happy durable routing for supervised repo work using proof=canary|soak, and stage-specific workload dispatch guidance that can say when dispatch is overkill. The guidance contract now includes an explicit executionLoop (plan -> approve -> execute/watch -> summarize) so bounded local slices steer toward reserve -> execute -> verify -> commit -> ask whether to push instead of drifting into dispatch/queue theatre. The harness-fix dogfood prompt still stays inline / host, keeps the provided acceptance criteria, and emits seven explicit stages instead of generic sludge; a rerun against active gremlin cleanup work preserves six concrete cleanup milestones plus a reusable plan artifact seeded from recent git history, and dispatch over the bounded repo-honesty slice now truthfully reports dispatch-is-overkill-keep-it-inline.
Current Story 4 truth (2026-03-08): skills/agent-workloads/SKILL.md, docs/workloads.md, docs/cli.md, and docs/skills.md now agree on the front-door posture: agents start with workload shape, read the returned guidance, ask approved?, and then follow the recommended execution loop. restate-workflows stays substrate-adjacent instead of pretending runtime docs are the front door for ordinary repo work.
Current Story 5 truth (2026-03-08): real dogfood is now earned for bounded serial/chained repo work: the harness-fix prompt and the gremlin repo-honesty cleanup both produce path-scoped inline plans, and the latter now carries truthful dispatch guidance instead of encouraging queue theatre. Parallel remains documented and example-backed, but still needs one real repo proof before Story 5 can be called done.
Immediate next-step decision: keep dogfooding real serial / parallel / chained repo work and tighten handoff follow-through; a standalone workload status surface is still theatre until there is more real scheduled execution to observe.
Done when: another agent can take a repo/coding task, choose the right workload shape, and hand it off safely without needing to understand Restate internals, queue families, or sandbox backend trivia

Phase 5: Full Inngest migration

Once the workload front door is agent-first and stable, migrate the remaining runtime ownership behind that interface and decommission Inngest as a consequence rather than the product story.

Functions → Restate services. Crons → Dkron. Events → deterministic core.
Done when: kubectl delete statefulset inngest-0 -n joelclaw and nothing breaks

Phase 6: Opus orchestration

Deep reasoning for anomaly diagnosis and complex queue decisions.

Sonnet escalates to Opus on sustained failures or novel patterns
Opus proposes fixes, registry changes, new handlers
Done when: Opus diagnoses a real production issue without human prompting

Consequences

What changes

joelclaw owns event routing end-to-end. No external dependency for core workflow behavior.
Queue decisions are intelligent and adaptive, not static config that rots.
The system communicates its queue reasoning in natural language.
One fewer k8s pod (Inngest), one fewer failure mode.

What’s harder

We own every bug in the queue core (mitigated: ~150 lines of Redis primitives).
Three-layer mental model vs Inngest’s monolith (mitigated: each layer is simpler than Inngest alone).
Model latency in triage path (mitigated: 100ms at 500 events/day is nothing).

Risks

Redis drops queued events → Restate journal is durability layer. Queue loss = re-emit, not data loss.
Model hallucinates routing → Can adjust priority/pause but cannot modify registry or bypass the deterministic core.
Gateway outage → Deterministic drainer runs independently. Dkron bypasses queue for critical crons.