No-Op Inference Circuit Breakers
- Status: proposed
- Date: 2026-03-02
- Deciders: Joel, Panda
- Relates to: ADR-0140, ADR-0146, ADR-0190
Context
Multiple inference paths repeatedly execute even when outcomes are unusable (null output, empty rewrite, parse failure). This creates latency and spend without yield.
Current behavior is fail-open with retries/fallback, but without a memory of repeated no-op outcomes at the action level.
Decision
1) Introduce per-action no-op circuit state
For each (component, action) pair, maintain circuit state in Redis:
closed— normal operationopen— skip expensive inference path and use deterministic fallbackhalf-open— probe with limited samples to test recovery
State key format:
inference:circuit:<component>:<action>:stateinference:circuit:<component>:<action>:stats
2) Define no-op failure signatures
Count these as no-op failures:
- output is null/empty after normalization
- required JSON parse fails
- zero completion where contract requires semantic output
- explicit rewrite-empty conditions (
inference_rewrite_empty)
3) Add configurable thresholds
Defaults (env-overridable):
JOELCLAW_INFER_NOOP_THRESHOLD=3JOELCLAW_INFER_NOOP_WINDOW_MS=900000(15m)JOELCLAW_INFER_NOOP_COOLDOWN_MS=1800000(30m)JOELCLAW_INFER_HALF_OPEN_PROBES=1
Behavior:
- if no-op failures reach threshold inside window →
opencircuit, - while
open, skip expensive call path and return deterministic fallback, - after cooldown, move to
half-openand allow probe calls, - successful probes close circuit; failed probes reopen immediately.
4) Enforce deterministic degraded behavior
When circuit is open, each action must provide one explicit fallback behavior, for example:
- use raw query without rewrite,
- return noop + actionable reason,
- use heuristic classifier,
- queue human-review event if no safe fallback exists.
5) Emit high-signal observability events
Required events:
inference.circuit.openedinference.circuit.half_openinference.circuit.closedinference.circuit.skipped_call
Each event must include: component, action, reason, failureCounts, windowMs, cooldownMs.
6) Keep control-plane alerts deduped
Gateway/operator alerting only on state transitions (closed→open, open→closed).
No spam on every skipped call.
Non-goals
- replacing inference-router model policy,
- hardcoding provider-specific behavior,
- permanently disabling actions without recovery path.
Consequences
Good
- repeated no-op loops become bounded,
- degraded paths stay available via deterministic fallbacks,
- operators get concise state transitions instead of noise.
Tradeoffs
- extra Redis/state complexity,
- threshold tuning required per action class.
Required Skills (Preflight)
system-bus— inference callsites and action metadatainngest-flow-control— cooldown/open/half-open transitionsinngest-durable-functions— durable state transition semanticslangfuse— no-op signal validation and outcome tracingsystem-architecture— ensure gate state aligns with worker runtime model
Implementation Plan (vector clock)
- V1: add reusable circuit module in
packages/system-bus/src/lib/(state + transitions + helpers). - V2: integrate circuit checks into
packages/system-bus/src/lib/inference.tsby(component, action). - V3: wire deterministic fallback responses and
skipped_callmetadata contract. - V4: emit circuit lifecycle OTEL + Langfuse metadata.
- V5: add targeted tests for open/half-open/close and deduped alert behavior.
Verification Checklist
- repeated no-op outcomes for one action open a circuit within configured threshold/window
- open circuits skip expensive inference and return deterministic fallback
- half-open probe behavior closes circuit on success and reopens on failure
- transition events are emitted exactly once per state transition
- no-op churn for affected actions drops after rollout (monitor via
joelclaw otel search "inference.circuit" --hours 24)
Implementation Progress
V1 (circuit module) — shipped 2026-03-04
Created packages/system-bus/src/lib/inference-circuit.ts. In-memory per-(component, action) circuit breaker for the long-running system-bus worker.
- Threshold: 3 consecutive no-op failures (configurable via
JOELCLAW_INFER_NOOP_THRESHOLD) - Window: 15min failure accumulation window (
JOELCLAW_INFER_NOOP_WINDOW_MS) - Cooldown: 30min before half-open probe (
JOELCLAW_INFER_NOOP_COOLDOWN_MS) - No-op signatures: empty/null output, JSON parse failure,
inference_rewrite_empty,inference_text_output_empty,inference_json_parse_empty - OTEL transitions:
inference.circuit.opened,inference.circuit.half_open,inference.circuit.closed(not every skip — no spam)
V2 (infer() integration) — shipped 2026-03-04
Wired into packages/system-bus/src/lib/inference.ts:
checkCircuit()before eachrunPiAttempt— open circuit skips pi spawnrecordSuccess()on successful inference returnrecordFailure()on no-op failure signaturescircuitStateadded to OTEL result metadatainference.circuit.skipped_callevent when circuit blocks an attempt
28 (component, action) callsites across system-bus automatically protected. Circuits start closed — zero behavior change until failures accumulate.
V3 (deterministic fallbacks) — implicit
Each calling function handles its own degraded behavior when inference throws. The circuit prevents the expensive pi spawn; callers already have try/catch with appropriate fallback (e.g., task-triage returns status: degraded).
V4 (OTEL lifecycle) — shipped with V1
Transition events are emitted on state changes only, not on every skip. Deduped by design.