ADR-0214accepted

Gateway Redis-Degraded Operation Mode

2026-03-06 implementation note

The first meaningful slice is now live in joelclaw:

  • gateway daemon exposes explicit runtime health with mode: normal | redis_degraded
  • Redis channel emits runtime mode transition telemetry
  • joelclaw gateway status prefers daemon health when Redis is down instead of reporting a fake total outage
  • joelclaw gateway diagnose treats redis_degraded as a degraded runtime and skips Redis-only E2E checks in that mode
  • session pressure is surfaced alongside runtime mode so operators can see when the daemon is alive but under stress

This does not mean every degraded capability is feature-complete. It means the decision is accepted and the operator-visible contract is now real.

Context

Today, Redis/Inngest substrate instability can make gateway behavior look dead even when the daemon process is alive and channel adapters are still running.

Observed operational failure shape:

  • joelclaw gateway status|events|restart fail on Redis connection closure.
  • Telegram can show transient typing... with no useful completion when the queue/event path is blocked.
  • Current startup path treats Redis as effectively mandatory for core orchestration paths.

This violates the availability posture for gateway as an always-on operator surface. We need deterministic behavior when Redis is unavailable: reduced capability, not total functional ambiguity.

Decision

Add an explicit Redis-degraded mode for gateway runtime.

1) Runtime mode contract

Gateway mode becomes one of:

  • normal — Redis channel and event bridge healthy.
  • redis_degraded — Redis unavailable; gateway remains online with reduced capabilities.

Mode changes must emit OTEL events and be visible in CLI health/status output.

2) Keep direct channel interaction alive without Redis

When Redis is unavailable:

  • inbound channel messages (Telegram/iMessage/Slack/Discord) still enqueue locally and process through command queue + pi session.
  • outbound channel replies still send normally.
  • no hard fail on Redis bootstrap.

This preserves core “talk to gateway” behavior during substrate outages.

3) Degrade Redis-dependent features explicitly

In redis_degraded mode, mark these capabilities degraded (not silently broken):

  • Redis event bridge ingestion (joelclaw:notify:*)
  • durable replay paths dependent on message store Redis streams
  • Redis-based Telegram poll-owner lease durability (fall back to direct polling + existing 409 backoff)
  • Redis-backed operational commands that require queue inspection/mutation

Each degraded feature must produce a clear status reason.

4) Self-healing recovery loop

Gateway must keep retrying Redis attach in background (bounded exponential backoff). On successful reattach:

  • transition redis_degraded -> normal
  • resume Redis-backed integrations
  • emit recovery telemetry

No daemon restart should be required for recovery.

5) Operator-visible diagnostics

joelclaw gateway status and joelclaw gateway diagnose must surface mode and degraded capabilities as first-class fields, not inferred from log scraping.

Implementation Plan (proposed)

Required skills preflight

  • gateway — daemon lifecycle + operator contracts
  • gateway-diagnose — failure-layer semantics and diagnostic UX
  • system-architecture — substrate vs daemon boundary and event path clarity

Files and surfaces

  1. packages/gateway/src/daemon.ts

    • Introduce runtime mode state (normal / redis_degraded).
    • Make Redis channel start non-fatal; run local-only queue path when unavailable.
    • Add background reconnect loop and mode transitions.
  2. packages/gateway/src/channels/redis.ts

    • Expose explicit attach/detach/retry status hooks for daemon mode management.
    • Ensure failures are reportable as structured reasons.
  3. packages/gateway/src/channels/telegram.ts

    • Keep Redis lease as preferred path.
    • In degraded mode, use lease fallback path intentionally and emit explicit degraded ownership telemetry.
  4. packages/cli/src/commands/gateway.ts

    • Extend gateway status payload with mode + degraded capability list.
    • Ensure status remains meaningful even when Redis is down.
  5. packages/cli/src/commands/gateway.ts (diagnose path)

    • Add Redis-degraded classification distinct from full process failure.
  6. docs/gateway.md + skills/gateway-diagnose/SKILL.md

    • Document degraded-mode behavior, operator expectations, and recovery signals.

Verification criteria

  • Gateway can start and process direct Telegram inbound/outbound while Redis is unavailable.
  • gateway status returns explicit mode: redis_degraded with capability degradation list.
  • OTEL emits mode transition events for both enter and recovery.
  • Redis reconnect without restart transitions runtime back to normal.
  • Existing normal-mode behavior remains unchanged when Redis is healthy.

Consequences

Good

  • Operator can still use gateway during Redis incidents.
  • Failures become explicit and diagnosable instead of ambiguous noop states.
  • Recovery path becomes automatic and observable.

Tradeoffs

  • More runtime-state complexity in daemon.
  • Some features remain unavailable in degraded mode by design.
  • Requires careful UX in CLI/diagnose to avoid false confidence.

Non-goals

  • Running full Inngest/event-bus durability without Redis.
  • Guaranteeing cross-instance Telegram poll ownership durability while Redis is down.
  • Solving k8s/Colima substrate outages in this ADR (this ADR is gateway behavior under those outages).