Agent-First Event-Driven Workflows
Status
Accepted
Accepted on 2026-03-07 after the architecture direction was proven live enough to stop pretending this was still speculative. Phase 1 (deterministic core + pilot cutovers + soak/operator proof), Phase 2 (bounded Haiku queue triage with live enforce proof on discovery + GitHub), and Phase 3 (Sonnet observation/control with earned pause/resume proof) are shipped for their scoped goals; Phases 4–6 remain open. Do not mark this ADR shipped overall until the agent-first workload interface is real and Inngest is actually decommissioned.
Intent
Build an event-driven system where LLM intelligence is the workflow engine, not just a consumer of one.
Traditional workflow engines (Inngest, Temporal, BullMQ) are developer-first: you configure static rules, they execute. Agent-first means the system reasons about what to do, when to do it, and whether it’s working — the way a competent human operator would. The human supervises; the system operates.
Removing Inngest is a side effect, not the goal. The goal is a system that handles event flow with judgment — contextual priority, adaptive throttling, self-diagnosis — so Joel stops thinking about queue plumbing entirely.
Desired Outcomes
The system operates like a competent operator. Specifically:
-
Events are triaged, not just routed. The system decides priority based on content, context, time of day, and system state — not a static number someone hardcoded six months ago.
-
Queue discipline adapts to reality. When a downstream API is rate-limited, the system backs off. When the cluster is degraded, non-critical work defers. When it’s 3am, background sync waits. No human adjusts knobs.
-
Anomalies are diagnosed, not just retried. A handler failing 8 times in 2 hours gets root-cause analysis, not exponential backoff into silence. The system tells Joel what’s wrong and what it tried.
-
The operator interface is natural language. “Queue depth is growing because Restate worker has high latency — I’ve paused tier-3 work and will resume when latency drops below 200ms.” Via Telegram, not a dashboard.
-
Inngest is fully decommissioned. All 110+ functions, 144 event types, migrated. One less pod, one less failure mode, one less external dependency for core behavior.
How We Know It Worked
- Joel goes a month without manually adjusting priority, restarting a stuck queue, or debugging why an event didn’t fire
- System self-heals from transient failures (rate limits, pod restarts, NAS blips) without human intervention
- When something genuinely novel breaks, Joel gets a Telegram message with diagnosis and proposed fix — not a generic error alert
- Inngest pod is deleted from k8s, all events flow through the new stack, zero regression
The Insight
joelclaw has been converging on this architecture across 18 ADRs without naming it:
| ADR | What it does | Queue behavior it exhibits |
|---|---|---|
| 0018 | Redis event bridge | Message bus |
| 0038 | Gateway daemon | Always-on intelligent agent |
| 0062 | Heartbeat triage | Periodic system observation |
| 0068 | Proposal auto-triage | LLM-powered admission control (Haiku) |
| 0078 | Model tier registry | Cost-aware model selection |
| 0090 | O11y triage loop | Self-healing, anomaly routing |
| 0207 | Restate execution | Durable, journaled workflows |
| 0216 | Dkron scheduler | Cron triggers |
40 files in system-bus already call infer(). The model tiers already exist in models.ts. The gateway already triages, routes, and self-heals. This ADR names what already exists and wires it into a unified architecture.
Decision
Architecture: Three Layers
┌──────────────────────────────────────┐
│ Intelligence Layer (LLM model tiers) │
│ │
│ Haiku: triage every event (~100ms) │
│ Sonnet: observe system state (~30s) │
│ Opus: diagnose anomalies (on-demand)│
│ │
│ ★ Not required for correctness. │
│ Offline → queue drains FIFO. │
└──────────────┬───────────────────────┘
│ scores, pauses, batches
┌──────────────▼───────────────────────┐
│ Deterministic Core (~150 lines) │
│ │
│ Redis sorted sets (priority queue) │
│ Static registry (event → handlers) │
│ Lua scripts (concurrency, rate limit)│
│ OTEL (every transition observable) │
│ │
│ ★ Always correct. Can't break. │
│ Works without the intelligence layer│
└──────────────┬───────────────────────┘
│ HTTP /send
┌──────────────▼───────────────────────┐
│ Restate (durable execution) │
│ │
│ Journaled steps, exactly-once │
│ DAG workflows, approval gates │
│ Handlers use infer() as needed │
└──────────────────────────────────────┘
Dkron (cron) ─┐
Webhooks ├──→ emit() ──→ Intelligence ──→ Core ──→ Restate
CLI │ ↑
Restate emit ─┘ (Haiku triage or passthrough)What “Agent-First” Means Concretely
Developer-first (Inngest):
// Static config. Set once, forget, hope it's still right.
createFunction(
{
id: "process-video",
concurrency: { limit: 2 }, // hardcoded
throttle: { limit: 10, period: "1m" }, // hardcoded
priority: { run: "event.data.priority" }, // static field
},
{ event: "video/requested" },
handler,
);Agent-first (joelclaw):
Event: video/requested arrives
Haiku (100ms): "Video download. NAS is healthy, no other video jobs queued,
it's daytime. Priority: normal. Route to: video-ingest handler."
→ ZADD jc:q:normal {score} {event} → ZPOPMIN → POST to Restate
Later, 5 videos arrive in 30 seconds:
Sonnet (periodic): "5 video downloads queued in 30s. NAS write throughput
is 660 MiB/s but transcription is CPU-bound. Setting concurrency to 2,
queueing the rest. Notifying Joel: '5 videos queued, processing 2 at a time,
ETA ~15 minutes for all.'"
Later, Mux API starts returning 429s:
Sonnet: "Mux rate limited. Pausing video-ingest queue. Will retry in 3 minutes.
Other queues unaffected."
Later, video-ingest has failed 8 consecutive times:
Opus: "Video ingest failing: Mux webhook URL returns 502. Checked: Vercel
deploy is healthy, but the webhook route has a new middleware that's crashing
on multipart payloads. Likely cause: commit abc123 from yesterday's loop.
Proposed fix: revert the middleware change. Creating Todoist task."The intelligence isn’t a chatbot bolted onto a queue. It IS the queue discipline. Every decision a human operator would make — “should I process this now?”, “is the system healthy enough?”, “why does this keep failing?” — is made by the appropriate model tier.
Model Tiers
| Tier | Model | When | Cost/day | Does what |
|---|---|---|---|---|
| Triage | Haiku | Every event | ~$0.02 | Classify, score priority, dedup, confirm routing |
| Routing | Sonnet | Every 30-60s | ~$0.10 | Observe system state, adjust queues, batch, throttle |
| Orchestration | Opus | On anomaly | ~$0.03 | Root cause analysis, self-heal, escalate |
| Total | ~$0.15 |
At 500 events/day, the LLM is the competitive advantage: it makes decisions no static system can match, and the cost is a rounding error. At 500k events/day this architecture would be absurd. We’re not building for 500k.
The Deterministic Core
~150 lines. Redis primitives. The correctness guarantee.
| Need | Redis primitive |
|---|---|
| Priority queue | ZADD / ZPOPMIN sorted sets |
| Concurrency | INCR/DECR + Lua guard |
| Rate limit | Token bucket Lua (~40 lines) |
| Debounce | SET key NX EX ttl |
| Fan-out | Registry lookup + loop |
| Batch | ZADD + threshold/timer drain |
Static event→handler registry in queue/registry.json (git-tracked). Without models, the core drains FIFO by static priority. Models offline = dumb queue. Models online = smart queue. Both correct.
CLI: The Operator Interface
joelclaw emit <event> [-d data] # enqueue
joelclaw queue depth # all queues
joelclaw queue list <name> # pending events
joelclaw queue pause <name> # pause
joelclaw queue drain <name> # force drain
joelclaw queue priority <event> -s N # reprioritize
joelclaw diagnose <queue|event> # trigger Opus analysisAlternatives Considered
BullMQ
Battle-tested Redis job queue with priority, concurrency, rate limiting. Rejected: Most of its value (retries, progress, stalled detection) is redundant with Restate. It’s the right answer for a system without durable execution or LLM intelligence. joelclaw has both. Adding BullMQ would mean paying for a library to get sorted sets and rate limiting — things Redis does natively in a few Lua scripts — while gaining zero intelligent queue management.
NATS JetStream
Lightweight messaging with persistence. Rejected: New infrastructure (another pod). No built-in priority, rate limiting, or debounce. We’d build all flow control on top anyway.
Raw Redis Streams
Append-only logs with consumer groups. Rejected: No priority ordering (FIFO only). Building priority on Streams requires multiple streams per level — sorted sets are the correct primitive.
Bespoke Redis without model layer
Same deterministic core, static config only. Not rejected — this is the fallback. If models prove unreliable for queue management, the core still works with hardcoded priority and concurrency. The model layer is additive. It can be disabled without breaking correctness.
Implementation Phases
Lifecycle note (2026-03-08): ADR status is accepted because the queue architecture and observation/control pilot are now proven live enough to stop treating them as speculative. Phases 1–3 are shipped scoped work. Phases 4–6 remain open work.
Detailed execution PRD: ADR-0217 Phase 1 Queue Execution Plan
Phase 2 execution PRD: ADR-0217 Phase 2 Triage Execution Plan
Phase 3 execution PRD: ADR-0217 Phase 3 Sonnet Observation PRD
Phase 4 execution PRD: ADR-0217 Phase 4 Agent-First Workload Ergonomics PRD
Runtime dependency PRD: Sandboxed Story Execution for Restate-Driven Work
Next batch PRD: ADR-0217 Next Batch — Restate-Native Sandboxed Workloads
Those queue execution PRDs are the concrete build sheets for the deterministic core first and the bounded Haiku triage layer next. The new next-batch PRD is the bridge from proven queue/control surfaces into one real Restate-native sandboxed workload lane.
Execution substrate boundary (2026-03-06, updated 2026-03-07)
- This ADR governs event ingress, routing, prioritization, and queue discipline.
- It does not choose shared-workspace execution as the long-term file-mutation surface.
- Autonomous story execution that changes code must follow ADR-0205 and ADR-0206, with the sandbox runtime PRD as the concrete build sheet.
- ADR-0172 reservations remain coordination support only; they do not replace filesystem isolation.
- The execution substrate is now in a clearer middle state: the local host-worker sandbox runner is proven and live, but the k8s Job runner is still next.
- That means ADR-0217 queue work should proceed on the deterministic queue/drainer path now, while keeping the k8s runtime swap scoped to ADR-0205/0206 follow-on work.
Phase 1: Deterministic core
Ship packages/queue/. Prove events flow webhook → Redis → Restate → completion → OTEL.
- Registry,
emit(), drainer loop, Lua scripts, CLI commands, OTEL - Migrate 5 high-traffic events off Inngest
- Current proof (2026-03-07): gateway-stopped queue drain and kill/restart replay are both now verified live for
discovery/noted. - Current pilot cutovers (2026-03-07): all five Phase-1 pilot families now have reversible queue cutovers: discovery ingress behind
QUEUE_PILOTS=discovery, discovery follow-up emission behindQUEUE_PILOTS=discovery-captured, launchd-backed content updates behindQUEUE_PILOTS=content, aggregate subscription requests behindQUEUE_PILOTS=subscriptions, and GitHub webhook ingress behindQUEUE_PILOTS=githubforgithub/workflow_run.completed. - Current GitHub ingress proof (2026-03-07): a signed POST to
/webhooks/githubreturned{ queued: 1, direct: 0 }, the shared queue drained back to zero, Restate emittedqueue.dispatch.started|completed, and downstream webhook-subscription OTEL proved the forwardedgithub/workflow_run.completedevent hit Inngest. - Current discovery-captured proof (2026-03-07): a real
discovery/notedevent forhttps://github.com/mksglu/claude-context-modedrovediscovery-captureto emitdiscovery.capture.forwardedwithmode: queue, the queue drained back to zero, Restate emittedqueue.dispatch.started|completed, and downstreamX Discovery Hook,Typesense: Queue Vault Re-index, andsystem/content-syncruns proved the forwardeddiscovery/capturedevent hit Inngest consumers. - Current content-updated proof (2026-03-07):
com.joel.content-sync-watchernow resolves from repo-canonicalinfra/launchd/com.joel.content-sync-watcher.plistandscripts/content-sync-watcher.sh; touchingVault/docs/decisions/0217-event-routing-queue-discipline.mdcaused the watcher to emitcontent/updatedviajoelclaw queue emit, Restate dispatched stream1772905340608-0, and downstreamsystem/content-syncplusTypesense: Queue Vault Re-indexruns confirmed the event reached Inngest consumers. - Operator-path correction earned by proof: the content pilot surfaced a real CLI bug where
joelclaw queue inspect <stream-id>crashed after ack because the error-envelope helper assumednextActionswas defined. The operator contract is now explicit: missing/expired messages returnQUEUE_MESSAGE_MISSINGwith queue-state next actions instead of exploding. - Story 5 operator surface (2026-03-07):
joelclaw queue statsis now the first soak/cutover CLI for the new queue path. It summarizes recentqueue.dispatch.started|completed|failedOTEL from the Restate drainer into current queue depth, terminal success/failure counts,waitTimeMslatency percentiles, dispatch-duration percentiles, promotion counts, top event families, and recent failures so Joel can answer “is the queue healthy enough to widen cutover?” without spelunking Redis or raw Typesense queries. - Live queue-stats proof (2026-03-07):
joelclaw queue stats --hours 1on the installed CLI returnedcurrentDepth.total=0,dispatches.started=9,dispatches.completed=9,dispatches.failed=0,queueLatencyMs.p95=1721, andqueueLatencyMs.withinTarget=trueon real pilot-family traffic. That is the first Story 5 cutover-gate readout earned from the operator surface itself. - Live burst proof (2026-03-07): a 12-event GitHub
workflow_run.completedburst was sent through the real webhook ingress with valid signatures. Every request came back queued (queued: 1, direct: 0),joelclaw queue depthobserved backlog11 → … → 1 → 0over roughly 22 seconds, and post-burstjoelclaw queue stats --hours 1showeddispatches.started=22,dispatches.completed=22,dispatches.failed=0, andgithub/workflow_run.completeddelta+12. So the burst-drains-to-zero gate is now proved. The same window also showedqueueLatencyMs.p95=20205/withinTarget=false, which means the queue survives the burst but had not yet earned the latency gate for that burst-conditioned hour window. - Throughput retune (2026-03-07): the right fix was the drainer, not the dashboard. Root cause: after a successful dispatch the drainer still waited for the next
QUEUE_DRAIN_INTERVAL_MSheartbeat before leasing the next ready message, so concurrency1effectively meant one send every ~2 seconds even when Restate accepted work in ~100-200ms. The drainer now self-pulses immediately when backlog remains and a slot frees; the interval is only the idle poll / retry heartbeat. After restarting the local Restate worker, the same 12-event GitHub burst drained11 → 9 → 6 → 4 → 2 → 0in about4556ms, andjoelclaw queue stats --hours 1 --limit 24isolated the fresh sample withdispatches.started=12,dispatches.completed=12,dispatches.failed=0,queueLatencyMs.p95=3718, andwithinTarget=true. - Soak hygiene correction (2026-03-07): the ugly 8-10 minute
content/updatedwaits that re-poisoned the soak window were traced to an ad-hoc host Restate lifecycle, not a new throughput regression. The live worker that came back at19:09:53Zwas a manualnohup bun run packages/restate/src/index.tsprocess (not a canonical launchd service), so its opaque SIGTERM/restart dirtied the same queue-stats window. The fix was to move the long-running host Restate worker onto repo-managed launchd assets:infra/launchd/com.joel.restate-worker.plistplusscripts/restate/start.sh. The wrapper loads shared env, prevents headlessCHANNEL=console, forwards SIGTERM to Bun, cleans stale:9080listeners, and opportunistically re-registers the deployment. Story 5 soak windows that spanqueue.drainer.startedduring rollout are therefore dirty by definition; post-rollout evidence should start from the supervised launchd runtime. - Anchored soak operator fix (2026-03-07):
joelclaw queue statsnow accepts--since <iso|ms>so operators can pin the sample lower bound to a known-clean moment instead of mixing fresh pilot traffic with a dirty pre-fix hour window. This was immediately proved on the installed CLI after fixing one real parser bug in the option handling. - Fresh supervised burst proof (2026-03-07): using the new anchored window (
joelclaw queue stats --since 1772916656015 --limit 64), a supervised 24-event GitHubworkflow_run.completedburst produceddispatches.started=24,dispatches.completed=24,dispatches.failed=0,queueLatencyMs.p95=2829,dispatchDurationMs.p95=229,withinTarget=true, and live queue depth returned to0. That re-earns the burst-conditioned latency gate on the clean post-launchd runtime instead of the earlier dirty hour window. - Discovery blocker root cause + fix (2026-03-07): the suspicious
QUEUE_PILOTS=discovery joelclaw discover <url>behavior turned out not to be a lying CLI path; the queue drainer had actually stalled inside an otherwise-running launchd Restate worker. Raw Redis showed queuedcontent/updated+discovery/notedentries piling up with no freshqueue.dispatch.*telemetry, and alaunchctl kickstart -k gui/$(id -u)/com.joel.restate-workerimmediately replayed the backlog back to zero. The fix was a drainer watchdog: when backlog remains but no progress occurs pastQUEUE_DRAIN_STALL_AFTER_MS, the worker now emitsqueue.drainer.stalledand exits non-zero so launchd can recover and replay instead of letting pilot traffic wedge silently. - Story 5 closeout proof (2026-03-07): after restarting the watchdog-enabled worker at
2026-03-07T21:23:47.592Z, a clean anchored window proved the previously suspect discovery ingress plus the GitHub burst path on the supervised runtime:QUEUE_PILOTS=discovery joelclaw discover https://example.com/?adr217-final-discovery=...produced a queueddiscovery/noted, downstreamdiscovery-capturecompleted,X Discovery Hookcompleted, and the same anchoredjoelclaw queue stats --since 1772918627592 --limit 128sample later showeddispatches.started=15,dispatches.completed=15,dispatches.failed=0,queueLatencyMs.p95=1584,dispatchDurationMs.p95=201,withinTarget=true,currentDepth.total=0, with event families led bygithub/workflow_run.completedand including bothdiscovery/notedanddiscovery/captured. The content watcher also continued to drivesystem/content-syncin that same clean window, while the subscriptions pilot remained backed by its earlier dedicated live queue proof. That closes the no-silent-drops gate honestly enough to call Story 5 done. - Registry rule clarified by bug fix: queue drainer
handler.targetvalues fortype: "inngest"must be concrete Inngest event names, not function ids. Usingsubscription/check-feedsorgithub/workflow-run-completedthere is wrong because the drainer posts events. - Done when: done on 2026-03-07 for the Phase-1 scope — five pilot families now have reversible cutovers, live drain/replay proofs, a CLI-first soak surface, and a supervised drainer recovery path instead of silent wedge behavior.
Phase 2: Haiku triage
Execution PRD: ADR-0217 Phase 2 Triage Execution Plan
Wire Haiku into the ingest path. Every event triaged before it hits the sorted set.
- Triage prompt: system context + event payload → priority score + handler confirmation
- Dedup, bounded route confirmation/mismatch signal, OTEL with reasoning
- Operator surface reports suggested vs applied decisions, fallback reasons, and latency
- Fallback: Haiku unavailable → static priority from registry
- Live Story 5 closeout proof (2026-03-07): with
QUEUE_TRIAGE_MODE=shadow,QUEUE_TRIAGE_FAMILIES=discovery,github, andQUEUE_TRIAGE_ENFORCE_FAMILIES=discovery,githubenabled in~/.config/system-bus.env, the anchored windowjoelclaw queue stats --since 1772926820195showeddispatches.started=9,dispatches.completed=9,dispatches.failed=0,queueLatencyMs.p95=1768, triageattempts=7,fallbacks=1,fallbackByReason=[timeout:1], andappliedChanges=3on the GitHub burst path. The deliberate 1-second timeout drill still returned a queueddiscovery/notedwithfallbackReason=timeout, andjoelclaw webhook stream whs_mmgys95ob3rremex --timeout 10 --replay 10replayed all three signed GitHub burst events downstream. - Honest caveat (2026-03-07): global
joelclaw queue depthinitially remained22, butjoelclaw queue list --limit 30showed the residue was unrelated oldercontent/updated+ pre-soak discovery backlog. The fresh Story 5discovery/notedstream ids were already missing fromjoelclaw queue inspect, so the enforced-family sample drained even though the overall queue was not pristine. - Backlog cleanup closeout (2026-03-07): the supervised
com.joel.restate-workerrestart did not reap those 22 stale never-claimed entries, so a bounded manual cleanup used@joelclaw/queue ack()to remove only orphanedcontent/updated+discovery/notedstream ids older than 1 hour after confirming zero pending leases.joelclaw queue depththen returned to0andjoelclaw queue listwent empty, so the Story 5 caveat is now operationally closed. - Done when: done on 2026-03-07 for the Phase 2 scope — events now get contextual priority scoring with bounded fallback/apply behavior visible in OTEL + CLI, and the two earned pilot families have live enforced mode without blocking enqueue
Phase 3: Sonnet observation
Execution PRD: ADR-0217 Phase 3 Sonnet Observation PRD
Wire Sonnet into the host-worker queue observation layer as a periodic observer with gateway-aware reporting.
- Reads queue depth, Restate load, OTEL error rates, time/sleep state, and active deterministic pauses
- Actions: reprioritize, pause, resume, batch, shed, escalate
- Telegram reporting in natural language
- Current Story 1 proof (2026-03-07): monorepo commit
c4a9dc8172091d6171f83f46218e076ec8d05b3badded canonical queue-observation contracts to@joelclaw/queue,packages/system-bus/src/lib/queue-observe.tsnow owns the bounded Sonnet prompt/schema/fallback helpers plusqueue.observe.started|completed|failed|fallbackandqueue.control.applied|expired|rejected, and focused tests/docs now pin the observation layer as contract-only work before any deterministic control plane exists - Current Story 2 proof (2026-03-07): monorepo commit
c3a31458ddfeac61c0f4b5a4ea8230bb1176472dadded the installed CLI dry-run operator surfacejoelclaw queue observe; an anchored live proof window (since=1772930468718) returneddecision.suggestedActions=[noop],appliedCount=0,history.attempts=1,history.completed=1,history.fallbacks=0, andcontrol.available=false, while the latest rawqueue-observeOTEL event wasqueue.observe.completed - Current Story 3 proof (2026-03-08): monorepo commit
7d50d389d0d6e0ff0ec6003611037ee2462a5d2cshipped deterministic queue-control state plus manual CLI apply/inspect surfaces. The first live proof attempts exposed a real operator-truth bug: queue commands were honoring an ambient shellREDIS_URLand writing pause state to the wrong Redis while the worker/drainer still used localhost from~/.config/system-bus.env. After wiringredisUrlinto the canonical CLI config and rebuilding the installed binary, the anchored pause/resume window (since=1772932667032) provedcontent/updatedstream1772932667379-0stayed queued for 5 seconds while paused and then drained after manual resume, and the TTL window (since=1772932701625) provedqueue.control.expiredfrom bothjoelclaw queue control statusand rawqueue-controlOTEL. - Current Story 4 code truth (2026-03-08): monorepo commits
d637c12dc7149e30d0547eb55886371bd1ad332e,353b81d1aa6f04d52c6c596b70d796525d0ac3a2,9741bc1f8cd72735f16d39482af81a5f8505b046, andf59ac37dnow cover the full bounded observer path: host-workerqueue/observer,QUEUE_OBSERVER_*flags, active-pause-aware snapshots, tolerant summary trimming, the hardened prompt/parser contract, and the settled-observer-pause health fix that deterministically emitsresume_familyinstead of reading an intentional hold as permanent downstream failure. - Current Story 4 live truth (2026-03-08): dry-run is earned on host, bounded content backlog probes now complete with real
queue/observeroutputs, and manualqueue/observer.requestedprobes no longer wait behind the cron pass because they run through the separate read-onlyqueue/observer-requestedsingleton path. Empty snapshots — even when an old manual pause is still active — now short-circuit to deterministicnoopinstead of burning a full Sonnet call, and idle empty snapshots with no recent failures now reportdownstreamState=healthyinstead of a noisy inherited degraded label. The first supervised enforce canary anchored atsince=1772981290859temporarily booted Restate out, seeded 30 queuedcontent/updatedevents, and let the cron observer see a real down/backlogged snapshot. On snapshotcca656f7-a9ce-4ca2-9f6d-0ed332f56a4d, the observer auto-appliedpause_family, emittedqueue.control.applied, queued the Telegram operator report, and the remaining backlog drained back to depth0after a supervised manual resume. The patched follow-up canary anchored atsince=1772985057594then proved the automatic resume leg too: it paused on snapshot1cb24e7b-f0cd-4e0c-ae5d-27cb4934b49a, auto-resumed on snapshot151aa03a-fced-41f0-9a54-2f3d1a70856d/ run01KK72HD0EMT3T34K8QP3SMEW9, and drained the held content item back to queue depth0without manual cleanup. The steady-state worker was then rolled back toQUEUE_OBSERVER_MODE=dry-run, so enforce remains a deliberate drill rather than the default runtime posture. - Current Story 5 soak truth (2026-03-08): yes, the soak has been running for a while.
joelclaw queue observe --hours 12found947rawqueue.observe.*OTEL documents (CLI sample truncated to the latest200docs), with the sampled history showingattempts=101,completed=99,failed=0,fallbacks=0, andsuccessRate=1.joelclaw queue stats --hours 12showeddispatches.started=66,dispatches.completed=66,dispatches.failed=0, andcurrentDepth.total=0. The explicit off-mode sanity window is now also earned: withQUEUE_OBSERVER_MODE=offanchored atsince=1772984026035, three queuedcontent/updatedcanary events drained cleanly back to depth0,queue statsshoweddispatches.started=3/completed=3/failed=0withwithinTarget=true,queue controlstayed empty, and the latest cronqueue/observerrun (01KK71AYHGWNMCR3J5Z20WN614) returned{ status: "disabled", mode: "off" }. The automatic resume leg is now earned too: the supervised follow-up canary anchored atsince=1772985057594produced a full automaticpause_family→resume_familycycle oncontent/updatedwith queue depth returning to0. - Human sanity pass: Joel reviewed the earned evidence and replied
ok ready to proceed, which closes the last Story 5 gate and makes the Phase 4 handoff explicit. - Current blocker: none for Story 5. The queue observer pilot has now earned the runtime proofs plus operator sign-off.
- Done when: Story 5 is done — and now is done — once automatic pause/resume, off-mode sanity, and Joel’s operator sanity pass all land on the same truthful timeline
Phase 4: Agent-first workload ergonomics
Make coding and repo work legible from the operator surface instead of forcing agents to reason about the runtime substrate first.
- Ship one obvious front door for workload planning and dispatch (
plan,run,status,explain,cancel) - Capture Joel steering as a stable workload-planning contract
- Make
serial,parallel, andchainedexecution first-class coding/repo workload shapes - Keep runtime selection behind the boundary: inline, durable, sandboxed, looped, or blocked should be planner decisions, not caller archaeology
- Narrow substrate skills like
restate-workflowsback to substrate work and give coding agents a proper front-door skill - Current Story 1 truth (2026-03-08):
docs/workloads.mdnow defines the first canonical workload vocabulary plus request / plan / handoff schema, andskills/agent-workloads/is the front-door skill for using that contract. - Current Story 2 truth (2026-03-08): monorepo commit
507ce5703faa90060b98bbc33ba8683881f81e97ships planner-onlyjoelclaw workload plan. It emits the canonicalrequest+planenvelope, inferskind/shape/mode/backendwhen the caller leaves them open, and keepsrun|status|explain|cancelexplicitly unshipped. Dogfooding immediately earned follow-up heuristic fixes: mixed implementation+docs intents no longer collapse intorepo.docs, sandbox comparisons no longer force sandbox mode unless isolation is actually requested,extend ... verify ... then update READMEnow stays implementation-shaped, and nouns likepublished skillsno longer implydeploy-allowed. - Current Story 3 truth (2026-03-08): monorepo commits
ffdaa6e285dc35e2e14249d2c5f8f5c50a9a23c1and45a246f0now ship the first real planning + dispatch ergonomics slice insidejoelclaw workload: presets,--paths-from status|head|recent:<n>,--write-planJSON artifacts, prompt-aware acceptance preservation fromAcceptance:, chained milestone decomposition fromGoal:, optionalreflect and update planstages, less trigger-happy durable routing for supervised repo work usingproof=canary|soak, and stage-specificworkload dispatchguidance that can say when dispatch is overkill. The guidance contract now includes an explicitexecutionLoop(plan -> approve -> execute/watch -> summarize) so bounded local slices steer toward reserve -> execute -> verify -> commit -> ask whether to push instead of drifting into dispatch/queue theatre. The harness-fix dogfood prompt still staysinline/host, keeps the provided acceptance criteria, and emits seven explicit stages instead of generic sludge; a rerun against active gremlin cleanup work preserves six concrete cleanup milestones plus a reusable plan artifact seeded from recent git history, and dispatch over the bounded repo-honesty slice now truthfully reportsdispatch-is-overkill-keep-it-inline. - Current Story 4 truth (2026-03-08):
skills/agent-workloads/SKILL.md,docs/workloads.md,docs/cli.md, anddocs/skills.mdnow agree on the front-door posture: agents start with workload shape, read the returned guidance, askapproved?, and then follow the recommended execution loop.restate-workflowsstays substrate-adjacent instead of pretending runtime docs are the front door for ordinary repo work. - Current Story 5 truth (2026-03-08): real dogfood is now earned for bounded serial/chained repo work: the harness-fix prompt and the gremlin repo-honesty cleanup both produce path-scoped inline plans, and the latter now carries truthful dispatch guidance instead of encouraging queue theatre. Parallel remains documented and example-backed, but still needs one real repo proof before Story 5 can be called done.
- Immediate next-step decision: keep dogfooding real serial / parallel / chained repo work and tighten handoff follow-through; a standalone
workload statussurface is still theatre until there is more real scheduled execution to observe. - Done when: another agent can take a repo/coding task, choose the right workload shape, and hand it off safely without needing to understand Restate internals, queue families, or sandbox backend trivia
Phase 5: Full Inngest migration
Once the workload front door is agent-first and stable, migrate the remaining runtime ownership behind that interface and decommission Inngest as a consequence rather than the product story.
- Functions → Restate services. Crons → Dkron. Events → deterministic core.
- Done when:
kubectl delete statefulset inngest-0 -n joelclawand nothing breaks
Phase 6: Opus orchestration
Deep reasoning for anomaly diagnosis and complex queue decisions.
- Sonnet escalates to Opus on sustained failures or novel patterns
- Opus proposes fixes, registry changes, new handlers
- Done when: Opus diagnoses a real production issue without human prompting
Consequences
What changes
- joelclaw owns event routing end-to-end. No external dependency for core workflow behavior.
- Queue decisions are intelligent and adaptive, not static config that rots.
- The system communicates its queue reasoning in natural language.
- One fewer k8s pod (Inngest), one fewer failure mode.
What’s harder
- We own every bug in the queue core (mitigated: ~150 lines of Redis primitives).
- Three-layer mental model vs Inngest’s monolith (mitigated: each layer is simpler than Inngest alone).
- Model latency in triage path (mitigated: 100ms at 500 events/day is nothing).
Risks
- Redis drops queued events → Restate journal is durability layer. Queue loss = re-emit, not data loss.
- Model hallucinates routing → Can adjust priority/pause but cannot modify registry or bypass the deterministic core.
- Gateway outage → Deterministic drainer runs independently. Dkron bypasses queue for critical crons.