Dogfooding Story 4: the queue observer earns dry-run, not enforce

Mar 8, 2026· updated Mar 8, 2026

queueobserverinngestjoelclawobservabilityarticle

The job for ADR-0217 Story 4 was simple on paper and annoying in practice: let Sonnet observe live queue state on a cadence, keep the control plane deterministic, and prove one bounded automatic mutation on a low-blast-radius family.

The low-blast-radius family was content/updated. The point was not to ship magical queue intelligence. The point was to see whether the system could earn one supervised automatic control path without turning correctness into vibes.

Human mode

What shipped is real.

A host-side queue/observer function now runs on cron and by manual trigger. It builds a bounded snapshot from the live queue, drainer telemetry, triage telemetry, and active control state. It calls Sonnet through the shared infer() path. It only gets to talk in a very small action language.

That part is earned.

The observer runs. The snapshot is truthful. The CLI can explain what it saw. Active pauses show up correctly. Overlong model summaries no longer blow up the whole parse path. Manual control still works without the model in the loop.

The part that did not earn it yet is autonomous mutation.

During supervised enforce testing, the observer looked at a fresh manual pause on content/updated, a small queued backlog, and a healthy downstream. Sonnet returned the correct boring answer: noop. Leave the fresh pause alone. Let it expire naturally.

That is a good answer, but it is not proof that the system has earned automatic pause/resume behavior in production. So the host worker went back to QUEUE_OBSERVER_MODE=dry-run.

That is the whole point of dogfooding: the loop does not get a gold star for existing. It has to earn the mode.

What changed

1. Story 4 runtime path is live

The observer now exists as a host-worker Inngest function with:

cron trigger
manual queue/observer.requested trigger
bounded family scope
bounded auto-apply scope
deterministic reporting and control emission

The live flags are:

QUEUE_OBSERVER_MODE=off|dry-run|enforce
QUEUE_OBSERVER_FAMILIES=discovery,content,subscriptions,github
QUEUE_OBSERVER_AUTO_FAMILIES=content
QUEUE_OBSERVER_INTERVAL_SECONDS=60

2. Hygiene got fixed before the canary

The first pass surfaced a real bug in check/system-health: nested step.sendEvent() calls inside step.run() were triggering NESTING_STEPS warnings and leaving bad operator truth around stale runs.

That got fixed first.

check/o11y-triage also showed failed history around the same time, but that turned out to be old runs finalizing after worker restart turbulence — not a fresh regression from Story 4.

3. The parser needed hardening under live traffic

The first real observer run failed for a dumb reason: Sonnet produced a summary longer than the schema allowed.

That is not a reason to fall over.

So the observation parser now trims overlong summary/message/reason fields instead of treating length noise as a fatal path.

4. The operator surface got more honest

The queue snapshot and CLI output now surface active deterministic pauses directly. That matters because resume_family is only a safe suggestion when a family is actually paused. Without that truth in the snapshot, the observer is guessing.

What the canary proved

The dry-run canary proved:

the host worker can run the observer on real state
the queue snapshot is grounded in live queue + OTEL + control data
the observer history is inspectable from the CLI
the parser survives the sloppy edges of model output better than it did on the first pass
the low-blast-radius family choice was right

The enforce canary proved something useful too, just not the triumphant thing.

It proved that a fresh manual pause with young backlog should not be overridden just because a model has opinions. The model declined to act. The safe move was rollback to dry-run.

That is a pass for discipline, not a pass for automation.

Agent mode

Scope that is actually earned

queue/observer runtime path: yes
dry-run operator review: yes
bounded content/updated auto-control in production: not yet

Safe current state

QUEUE_OBSERVER_MODE=dry-run
QUEUE_OBSERVER_FAMILIES=discovery,content,subscriptions,github
QUEUE_OBSERVER_AUTO_FAMILIES=content
QUEUE_OBSERVER_INTERVAL_SECONDS=60

Useful checks

joelclaw queue observe
joelclaw queue control status
joelclaw queue stats
joelclaw runs --count 10 --hours 1
joelclaw otel search "queue.observe OR queue.control" --hours 1

The gating lesson

Do not promote Story 4 just because queue/observer exists.

Promote it when the system shows a real automatic pause/resume cycle on content/updated, with:

no queue loss
truthful CLI evidence
truthful OTEL evidence
understandable operator reporting
a clean rollback path back to dry-run

Until then, dry-run is the earned mode.

Next move

The next useful drill is not a bigger blast radius. It is a cleaner one.

Create a deliberate content/updated degraded-state drill that actually justifies automatic action, then verify pause, report, and resume from one anchored window. If the evidence is clean, Story 4 can advance. If not, it stays where it is.