ADR-0191accepted

No-Op Inference Circuit Breakers

  • Status: proposed
  • Date: 2026-03-02
  • Deciders: Joel, Panda
  • Relates to: ADR-0140, ADR-0146, ADR-0190

Context

Multiple inference paths repeatedly execute even when outcomes are unusable (null output, empty rewrite, parse failure). This creates latency and spend without yield.

Current behavior is fail-open with retries/fallback, but without a memory of repeated no-op outcomes at the action level.

Decision

1) Introduce per-action no-op circuit state

For each (component, action) pair, maintain circuit state in Redis:

  • closed — normal operation
  • open — skip expensive inference path and use deterministic fallback
  • half-open — probe with limited samples to test recovery

State key format:

  • inference:circuit:<component>:<action>:state
  • inference:circuit:<component>:<action>:stats

2) Define no-op failure signatures

Count these as no-op failures:

  • output is null/empty after normalization
  • required JSON parse fails
  • zero completion where contract requires semantic output
  • explicit rewrite-empty conditions (inference_rewrite_empty)

3) Add configurable thresholds

Defaults (env-overridable):

  • JOELCLAW_INFER_NOOP_THRESHOLD=3
  • JOELCLAW_INFER_NOOP_WINDOW_MS=900000 (15m)
  • JOELCLAW_INFER_NOOP_COOLDOWN_MS=1800000 (30m)
  • JOELCLAW_INFER_HALF_OPEN_PROBES=1

Behavior:

  1. if no-op failures reach threshold inside window → open circuit,
  2. while open, skip expensive call path and return deterministic fallback,
  3. after cooldown, move to half-open and allow probe calls,
  4. successful probes close circuit; failed probes reopen immediately.

4) Enforce deterministic degraded behavior

When circuit is open, each action must provide one explicit fallback behavior, for example:

  • use raw query without rewrite,
  • return noop + actionable reason,
  • use heuristic classifier,
  • queue human-review event if no safe fallback exists.

5) Emit high-signal observability events

Required events:

  • inference.circuit.opened
  • inference.circuit.half_open
  • inference.circuit.closed
  • inference.circuit.skipped_call

Each event must include: component, action, reason, failureCounts, windowMs, cooldownMs.

6) Keep control-plane alerts deduped

Gateway/operator alerting only on state transitions (closed→open, open→closed). No spam on every skipped call.

Non-goals

  • replacing inference-router model policy,
  • hardcoding provider-specific behavior,
  • permanently disabling actions without recovery path.

Consequences

Good

  • repeated no-op loops become bounded,
  • degraded paths stay available via deterministic fallbacks,
  • operators get concise state transitions instead of noise.

Tradeoffs

  • extra Redis/state complexity,
  • threshold tuning required per action class.

Required Skills (Preflight)

  • system-bus — inference callsites and action metadata
  • inngest-flow-control — cooldown/open/half-open transitions
  • inngest-durable-functions — durable state transition semantics
  • langfuse — no-op signal validation and outcome tracing
  • system-architecture — ensure gate state aligns with worker runtime model

Implementation Plan (vector clock)

  1. V1: add reusable circuit module in packages/system-bus/src/lib/ (state + transitions + helpers).
  2. V2: integrate circuit checks into packages/system-bus/src/lib/inference.ts by (component, action).
  3. V3: wire deterministic fallback responses and skipped_call metadata contract.
  4. V4: emit circuit lifecycle OTEL + Langfuse metadata.
  5. V5: add targeted tests for open/half-open/close and deduped alert behavior.

Verification Checklist

  • repeated no-op outcomes for one action open a circuit within configured threshold/window
  • open circuits skip expensive inference and return deterministic fallback
  • half-open probe behavior closes circuit on success and reopens on failure
  • transition events are emitted exactly once per state transition
  • no-op churn for affected actions drops after rollout (monitor via joelclaw otel search "inference.circuit" --hours 24)

Implementation Progress

V1 (circuit module) — shipped 2026-03-04

Created packages/system-bus/src/lib/inference-circuit.ts. In-memory per-(component, action) circuit breaker for the long-running system-bus worker.

  • Threshold: 3 consecutive no-op failures (configurable via JOELCLAW_INFER_NOOP_THRESHOLD)
  • Window: 15min failure accumulation window (JOELCLAW_INFER_NOOP_WINDOW_MS)
  • Cooldown: 30min before half-open probe (JOELCLAW_INFER_NOOP_COOLDOWN_MS)
  • No-op signatures: empty/null output, JSON parse failure, inference_rewrite_empty, inference_text_output_empty, inference_json_parse_empty
  • OTEL transitions: inference.circuit.opened, inference.circuit.half_open, inference.circuit.closed (not every skip — no spam)

V2 (infer() integration) — shipped 2026-03-04

Wired into packages/system-bus/src/lib/inference.ts:

  • checkCircuit() before each runPiAttempt — open circuit skips pi spawn
  • recordSuccess() on successful inference return
  • recordFailure() on no-op failure signatures
  • circuitState added to OTEL result metadata
  • inference.circuit.skipped_call event when circuit blocks an attempt

28 (component, action) callsites across system-bus automatically protected. Circuits start closed — zero behavior change until failures accumulate.

V3 (deterministic fallbacks) — implicit

Each calling function handles its own degraded behavior when inference throws. The circuit prevents the expensive pi spawn; callers already have try/catch with appropriate fallback (e.g., task-triage returns status: degraded).

V4 (OTEL lifecycle) — shipped with V1

Transition events are emitted on state changes only, not on every skip. Deduped by design.