Gateway Redis-Degraded Operation Mode
2026-03-06 implementation note
The first meaningful slice is now live in joelclaw:
- gateway daemon exposes explicit runtime health with
mode: normal | redis_degraded - Redis channel emits runtime mode transition telemetry
joelclaw gateway statusprefers daemon health when Redis is down instead of reporting a fake total outagejoelclaw gateway diagnosetreatsredis_degradedas a degraded runtime and skips Redis-only E2E checks in that mode- session pressure is surfaced alongside runtime mode so operators can see when the daemon is alive but under stress
This does not mean every degraded capability is feature-complete. It means the decision is accepted and the operator-visible contract is now real.
Context
Today, Redis/Inngest substrate instability can make gateway behavior look dead even when the daemon process is alive and channel adapters are still running.
Observed operational failure shape:
joelclaw gateway status|events|restartfail on Redis connection closure.- Telegram can show transient
typing...with no useful completion when the queue/event path is blocked. - Current startup path treats Redis as effectively mandatory for core orchestration paths.
This violates the availability posture for gateway as an always-on operator surface. We need deterministic behavior when Redis is unavailable: reduced capability, not total functional ambiguity.
Decision
Add an explicit Redis-degraded mode for gateway runtime.
1) Runtime mode contract
Gateway mode becomes one of:
normal— Redis channel and event bridge healthy.redis_degraded— Redis unavailable; gateway remains online with reduced capabilities.
Mode changes must emit OTEL events and be visible in CLI health/status output.
2) Keep direct channel interaction alive without Redis
When Redis is unavailable:
- inbound channel messages (Telegram/iMessage/Slack/Discord) still enqueue locally and process through command queue + pi session.
- outbound channel replies still send normally.
- no hard fail on Redis bootstrap.
This preserves core “talk to gateway” behavior during substrate outages.
3) Degrade Redis-dependent features explicitly
In redis_degraded mode, mark these capabilities degraded (not silently broken):
- Redis event bridge ingestion (
joelclaw:notify:*) - durable replay paths dependent on message store Redis streams
- Redis-based Telegram poll-owner lease durability (fall back to direct polling + existing 409 backoff)
- Redis-backed operational commands that require queue inspection/mutation
Each degraded feature must produce a clear status reason.
4) Self-healing recovery loop
Gateway must keep retrying Redis attach in background (bounded exponential backoff). On successful reattach:
- transition
redis_degraded -> normal - resume Redis-backed integrations
- emit recovery telemetry
No daemon restart should be required for recovery.
5) Operator-visible diagnostics
joelclaw gateway status and joelclaw gateway diagnose must surface mode and degraded capabilities as first-class fields, not inferred from log scraping.
Implementation Plan (proposed)
Required skills preflight
gateway— daemon lifecycle + operator contractsgateway-diagnose— failure-layer semantics and diagnostic UXsystem-architecture— substrate vs daemon boundary and event path clarity
Files and surfaces
-
packages/gateway/src/daemon.ts- Introduce runtime mode state (
normal/redis_degraded). - Make Redis channel start non-fatal; run local-only queue path when unavailable.
- Add background reconnect loop and mode transitions.
- Introduce runtime mode state (
-
packages/gateway/src/channels/redis.ts- Expose explicit attach/detach/retry status hooks for daemon mode management.
- Ensure failures are reportable as structured reasons.
-
packages/gateway/src/channels/telegram.ts- Keep Redis lease as preferred path.
- In degraded mode, use lease fallback path intentionally and emit explicit degraded ownership telemetry.
-
packages/cli/src/commands/gateway.ts- Extend
gateway statuspayload with mode + degraded capability list. - Ensure status remains meaningful even when Redis is down.
- Extend
-
packages/cli/src/commands/gateway.ts(diagnosepath)- Add Redis-degraded classification distinct from full process failure.
-
docs/gateway.md+skills/gateway-diagnose/SKILL.md- Document degraded-mode behavior, operator expectations, and recovery signals.
Verification criteria
- Gateway can start and process direct Telegram inbound/outbound while Redis is unavailable.
-
gateway statusreturns explicitmode: redis_degradedwith capability degradation list. - OTEL emits mode transition events for both enter and recovery.
- Redis reconnect without restart transitions runtime back to
normal. - Existing normal-mode behavior remains unchanged when Redis is healthy.
Consequences
Good
- Operator can still use gateway during Redis incidents.
- Failures become explicit and diagnosable instead of ambiguous noop states.
- Recovery path becomes automatic and observable.
Tradeoffs
- More runtime-state complexity in daemon.
- Some features remain unavailable in degraded mode by design.
- Requires careful UX in CLI/diagnose to avoid false confidence.
Non-goals
- Running full Inngest/event-bus durability without Redis.
- Guaranteeing cross-instance Telegram poll ownership durability while Redis is down.
- Solving k8s/Colima substrate outages in this ADR (this ADR is gateway behavior under those outages).