Dogfooding Story 4: the queue observer earns dry-run, not enforce
The job for ADR-0217 Story 4 was simple on paper and annoying in practice: let Sonnet observe live queue state on a cadence, keep the control plane deterministic, and prove one bounded automatic mutation on a low-blast-radius family.
The low-blast-radius family was content/updated. The point was not to ship magical queue intelligence. The point was to see whether the system could earn one supervised automatic control path without turning correctness into vibes.
Human mode
What shipped is real.
A host-side queue/observer function now runs on cron and by manual trigger. It builds a bounded snapshot from the live queue, drainer telemetry, triage telemetry, and active control state. It calls Sonnet through the shared infer() path. It only gets to talk in a very small action language.
That part is earned.
The observer runs. The snapshot is truthful. The CLI can explain what it saw. Active pauses show up correctly. Overlong model summaries no longer blow up the whole parse path. Manual control still works without the model in the loop.
The part that did not earn it yet is autonomous mutation.
During supervised enforce testing, the observer looked at a fresh manual pause on content/updated, a small queued backlog, and a healthy downstream. Sonnet returned the correct boring answer: noop. Leave the fresh pause alone. Let it expire naturally.
That is a good answer, but it is not proof that the system has earned automatic pause/resume behavior in production. So the host worker went back to QUEUE_OBSERVER_MODE=dry-run.
That is the whole point of dogfooding: the loop does not get a gold star for existing. It has to earn the mode.
What changed
1. Story 4 runtime path is live
The observer now exists as a host-worker Inngest function with:
- cron trigger
- manual
queue/observer.requestedtrigger - bounded family scope
- bounded auto-apply scope
- deterministic reporting and control emission
The live flags are:
QUEUE_OBSERVER_MODE=off|dry-run|enforce
QUEUE_OBSERVER_FAMILIES=discovery,content,subscriptions,github
QUEUE_OBSERVER_AUTO_FAMILIES=content
QUEUE_OBSERVER_INTERVAL_SECONDS=602. Hygiene got fixed before the canary
The first pass surfaced a real bug in check/system-health: nested step.sendEvent() calls inside step.run() were triggering NESTING_STEPS warnings and leaving bad operator truth around stale runs.
That got fixed first.
check/o11y-triage also showed failed history around the same time, but that turned out to be old runs finalizing after worker restart turbulence — not a fresh regression from Story 4.
3. The parser needed hardening under live traffic
The first real observer run failed for a dumb reason: Sonnet produced a summary longer than the schema allowed.
That is not a reason to fall over.
So the observation parser now trims overlong summary/message/reason fields instead of treating length noise as a fatal path.
4. The operator surface got more honest
The queue snapshot and CLI output now surface active deterministic pauses directly. That matters because resume_family is only a safe suggestion when a family is actually paused. Without that truth in the snapshot, the observer is guessing.
What the canary proved
The dry-run canary proved:
- the host worker can run the observer on real state
- the queue snapshot is grounded in live queue + OTEL + control data
- the observer history is inspectable from the CLI
- the parser survives the sloppy edges of model output better than it did on the first pass
- the low-blast-radius family choice was right
The enforce canary proved something useful too, just not the triumphant thing.
It proved that a fresh manual pause with young backlog should not be overridden just because a model has opinions. The model declined to act. The safe move was rollback to dry-run.
That is a pass for discipline, not a pass for automation.
Agent mode
Scope that is actually earned
queue/observerruntime path: yes- dry-run operator review: yes
- bounded
content/updatedauto-control in production: not yet
Safe current state
QUEUE_OBSERVER_MODE=dry-run
QUEUE_OBSERVER_FAMILIES=discovery,content,subscriptions,github
QUEUE_OBSERVER_AUTO_FAMILIES=content
QUEUE_OBSERVER_INTERVAL_SECONDS=60Useful checks
joelclaw queue observe
joelclaw queue control status
joelclaw queue stats
joelclaw runs --count 10 --hours 1
joelclaw otel search "queue.observe OR queue.control" --hours 1The gating lesson
Do not promote Story 4 just because queue/observer exists.
Promote it when the system shows a real automatic pause/resume cycle on content/updated, with:
- no queue loss
- truthful CLI evidence
- truthful OTEL evidence
- understandable operator reporting
- a clean rollback path back to
dry-run
Until then, dry-run is the earned mode.
Next move
The next useful drill is not a bigger blast radius. It is a cleaner one.
Create a deliberate content/updated degraded-state drill that actually justifies automatic action, then verify pause, report, and resume from one anchored window. If the evidence is clean, Story 4 can advance. If not, it stays where it is.