ADR-0214accepted

Gateway Redis-Degraded Operation Mode

2026-03-05T00:00:00.000Z

2026-03-06 implementation note

The first meaningful slice is now live in joelclaw:

gateway daemon exposes explicit runtime health with mode: normal | redis_degraded
Redis channel emits runtime mode transition telemetry
joelclaw gateway status prefers daemon health when Redis is down instead of reporting a fake total outage
joelclaw gateway diagnose treats redis_degraded as a degraded runtime and skips Redis-only E2E checks in that mode
session pressure is surfaced alongside runtime mode so operators can see when the daemon is alive but under stress

This does not mean every degraded capability is feature-complete. It means the decision is accepted and the operator-visible contract is now real.

Context

Today, Redis/Inngest substrate instability can make gateway behavior look dead even when the daemon process is alive and channel adapters are still running.

Observed operational failure shape:

joelclaw gateway status|events|restart fail on Redis connection closure.
Telegram can show transient typing... with no useful completion when the queue/event path is blocked.
Current startup path treats Redis as effectively mandatory for core orchestration paths.

This violates the availability posture for gateway as an always-on operator surface. We need deterministic behavior when Redis is unavailable: reduced capability, not total functional ambiguity.

Decision

Add an explicit Redis-degraded mode for gateway runtime.

1) Runtime mode contract

Gateway mode becomes one of:

normal — Redis channel and event bridge healthy.
redis_degraded — Redis unavailable; gateway remains online with reduced capabilities.

Mode changes must emit OTEL events and be visible in CLI health/status output.

2) Keep direct channel interaction alive without Redis

When Redis is unavailable:

inbound channel messages (Telegram/iMessage/Slack/Discord) still enqueue locally and process through command queue + pi session.
outbound channel replies still send normally.
no hard fail on Redis bootstrap.

This preserves core “talk to gateway” behavior during substrate outages.

3) Degrade Redis-dependent features explicitly

In redis_degraded mode, mark these capabilities degraded (not silently broken):

Redis event bridge ingestion (joelclaw:notify:*)
durable replay paths dependent on message store Redis streams
Redis-based Telegram poll-owner lease durability (fall back to direct polling + existing 409 backoff)
Redis-backed operational commands that require queue inspection/mutation

Each degraded feature must produce a clear status reason.

4) Self-healing recovery loop

Gateway must keep retrying Redis attach in background (bounded exponential backoff). On successful reattach:

transition redis_degraded -> normal
resume Redis-backed integrations
emit recovery telemetry

No daemon restart should be required for recovery.

gateway — daemon lifecycle + operator contracts
gateway-diagnose — failure-layer semantics and diagnostic UX
system-architecture — substrate vs daemon boundary and event path clarity

Files and surfaces

packages/gateway/src/daemon.ts
- Introduce runtime mode state (normal / redis_degraded).
- Make Redis channel start non-fatal; run local-only queue path when unavailable.
- Add background reconnect loop and mode transitions.
packages/gateway/src/channels/redis.ts
- Expose explicit attach/detach/retry status hooks for daemon mode management.
- Ensure failures are reportable as structured reasons.
packages/gateway/src/channels/telegram.ts
- Keep Redis lease as preferred path.
- In degraded mode, use lease fallback path intentionally and emit explicit degraded ownership telemetry.
packages/cli/src/commands/gateway.ts
- Extend gateway status payload with mode + degraded capability list.
- Ensure status remains meaningful even when Redis is down.
packages/cli/src/commands/gateway.ts (diagnose path)
- Add Redis-degraded classification distinct from full process failure.
docs/gateway.md + skills/gateway-diagnose/SKILL.md
- Document degraded-mode behavior, operator expectations, and recovery signals.

Verification criteria

Gateway can start and process direct Telegram inbound/outbound while Redis is unavailable.
gateway status returns explicit mode: redis_degraded with capability degradation list.
OTEL emits mode transition events for both enter and recovery.
Redis reconnect without restart transitions runtime back to normal.
Existing normal-mode behavior remains unchanged when Redis is healthy.

Consequences

Good

Operator can still use gateway during Redis incidents.
Failures become explicit and diagnosable instead of ambiguous noop states.
Recovery path becomes automatic and observable.

Tradeoffs

More runtime-state complexity in daemon.
Some features remain unavailable in degraded mode by design.
Requires careful UX in CLI/diagnose to avoid false confidence.

Non-goals

Running full Inngest/event-bus durability without Redis.
Guaranteeing cross-instance Telegram poll ownership durability while Redis is down.
Solving k8s/Colima substrate outages in this ADR (this ADR is gateway behavior under those outages).