ADR-0091shipped

Gateway Model Fallback

Context

The gateway daemon runs on claude-opus-4-6 (Anthropic). When the Anthropic API is slow or down, the gateway goes unresponsive — messages queue up, heartbeats bounce, and the only recovery is manual restart. On 2026-02-21, the API was sluggish enough that Opus took 4+ minutes with no streaming tokens on a simple test prompt.

The gateway also already has a prompt-racing bug (fixed same day: drain loop now gates on turn_end), but API slowness is a separate failure mode that needs its own mitigation.

Decision

Add automatic model fallback to the gateway daemon:

  1. Dual trigger: Fallback activates on EITHER:

    • Timeout: No streaming tokens received within 90s of prompt dispatch
    • Consecutive failures: 2+ prompt errors (catches API errors, auth failures, rate limits)
  2. Fallback model: Configurable via Redis (joelclaw:gateway:configfallbackProvider + fallbackModel). Default: openai-codex/gpt-5.4. Cross-provider fallback is the standard path; legacy Anthropic fallback configs are remapped on startup when the embedded pi model registry supports the preferred fallback.

  3. Hot-swap via session.setModel(): Pi’s AgentSession.setModel() validates API key, updates session state, re-clamps thinking level. No restart needed.

  4. Recovery: After successful fallback response, set a recovery timer. After 10 minutes on fallback, probe the live primary session model and switch back if it succeeds. Recovery must use the actual active session.model as truth, not only the requested gateway config, so provider/model drift cannot make probe state lie.

  5. Startup reconciliation: when the daemon resumes an existing pi session, it must reconcile session.model back to the Redis-configured primary model before fallback control initializes. Resume continuity is good; silently preserving the last fallback/manual model is not.

  6. Notification: Telegram alert on fallback activation and recovery. Gateway status endpoint reports current model and fallback state.

Consequences

  • Gateway stays responsive even during provider outages
  • Cross-provider fallback requires API keys for both providers in the environment
  • Fallback model may have different capabilities (e.g., codex-spark has no image support, smaller context)
  • Session context is preserved across model swaps — pi handles this internally
  • Recovery probe adds one extra API call every 10 minutes while on fallback
  • If configured fallback matches the active primary model, startup should remap to a distinct compatible fallback instead of pretending a no-op fallback is useful
  • Resumed sessions can preserve stale model state; startup must reconcile back to the configured primary or the daemon can look healthy while running the wrong model

Configuration

Redis key joelclaw:gateway:config gains:

  • fallbackProvider: string (default: “openai-codex”)
  • fallbackModel: string (default: “gpt-5.4”)
  • fallbackTimeoutMs: number (default: 120000, but floored to 240000 when primary is claude-opus-4-6) — raised from 90s on 2026-02-22, then Opus-primary floor added on 2026-03-12
  • fallbackAfterFailures: number (default: 3) — raised from 2 on 2026-02-22
  • recoveryProbeIntervalMs: number (default: 600000)

The gateway package’s embedded @mariozechner/pi-ai / @mariozechner/pi-coding-agent versions must stay aligned with the model policy. Machine pi --version is not enough if packages/gateway/package.json still pins an older model catalog.

Threshold Tuning (2026-02-22)

First 72h showed 8 activations — too aggressive. Raised timeout from 90s→120s and failure threshold from 2→3. Added comprehensive o11y:

  • Opus primary floor (2026-03-12): when the configured primary model is claude-opus-4-6, clamp fallbackTimeoutMs to at least 240000. By then OTEL and live logs showed repeated near-misses plus real fallback churn against the stale 120s threshold, while ADR context already recorded Opus first-token latency beyond two minutes.
  • prompt.latency: per-prompt TTFT and total duration, emitted on every turn end
  • prompt.near_miss: warn when prompt takes >75% of timeout — early signal before actual trips
  • Recovery probes: elevated to info-level logging with downtime and probe count
  • Activation context: prompt elapsed time, TTFT, and threshold config included in swap events

This data enables evidence-based threshold tuning going forward.

Implementation note (2026-03-12, resilience follow-up): fallback control now borrows OpenClaw’s probe-discipline pattern instead of treating the primary as infinitely pokeable. Recovery probes emit structured model_fallback.decision telemetry with explicit reason buckets (timeout, consecutive_failures, rate_limit, provider_overloaded, auth, etc.), and failed probes now set a probe backoff window so the daemon stops hammering the same sick provider every recovery tick. The contract shift is small but important: fallback is still automatic, but recovery probing is now rate-aware instead of chatter-prone.