ADR-0218accepted

Gateway Availability, Lifecycle, and Operator QoL Improvements

Context and Problem Statement

The gateway has improved quickly, but the last week of actual usage showed the same ugly failure classes repeating:

  1. Substrate coupling still makes the gateway look dead when Redis, Colima, or localhost wiring goes sideways.
  2. Prompt-only guardrails are not enough. They help, but they do not enforce availability-first behavior.
  3. Session pressure is still too opaque. We keep learning about compaction/rotation trouble after fallback thrash starts.
  4. Operator interaction paths still fail too silently. Buttons can toast “done” while nothing actually happened.
  5. Channel lifecycle is still too bespoke. Telegram got hardening because it hurt first; the same ownership/half-dead patterns should become gateway-wide contracts.

This ADR is not a fresh brainstorm. It is a ranked response to real incidents recorded in slog, memory, ADRs, and session evidence.

Evidence

Incident and usage evidence

  • [[../../system/log/1256-incident-gateway]] — runaway gateway session: 476 minutes, 57 commits, broken deploy, zero deploy verification, no check-in cadence.
  • [[../../system/log/1306-configure-gateway]] — disabled com.joel.gateway launchd job caused queue buildup and process-layer ambiguity.
  • [[../../system/log/1307-improve-gateway-cli]] — CLI had to learn exact launchd state and self-heal disabled jobs because status/restart semantics were ambiguous.
  • [[../../system/log/1308-fix-gateway]] — three silent failures stacked in the ADR pitch callback chain; buttons looked successful while the event path was dead.
  • [[../../system/log/1347-fix-gateway]] — thread explosion produced 169k-token context blowout and fallback cascade.
  • [[../../system/log/1353-fix-gateway]] — overnight fallback thrash: 92 fallback activations, 83 timeouts, 128 model swaps.
  • [[../../system/log/1365-fix-gateway]] — Telegram 409 loop made the gateway effectively noop on human messages.
  • [[../../system/log/1366-configure-gateway]] — callback routing had to be re-architected so the gateway owns the sole polling connection.
  • [[../../system/log/1372-fix-gateway]] — fresh session was the only true reset after context bloat.
  • [[../../system/log/1374-fix-gateway]] — two-tier context management (compact at 65%, rotate at 75%) was necessary because single-tier compaction at 85% was too late.
  • [[../../system/log/1384-fix-gateway]] — stale Redis port-forward caused OTEL timeouts and gateway thrash despite Redis being healthy behind Docker port mapping.
  • [[../../system/log/1385-configure-talon]] — Talon needed raw Redis + OTEL emit probes because this class of silent degradation was otherwise invisible until the gateway was already in trouble.

Existing decision inventory

  • [[0103-gateway-session-isolation]] correctly established that the gateway pi session is a scarce interactive resource.
  • [[0189-gateway-guardrails]] established the right operator discipline, but much of it still lives in prompt text instead of runtime enforcement.
  • [[0213-gateway-session-lifecycle-guards]] improved compaction/rotation behavior, but evidence from 2026-03-06 shows visibility and thresholds still matter.
  • [[0214-gateway-redis-degraded-mode]] is the clearest unfinished reliability decision on the table.

External pattern library

The OpenClaw research note [[../../Projects/openclaw/gateway-lifecycle-deep-dive]] surfaces five patterns worth importing:

  1. explicit degraded states instead of fake “healthy” ambiguity
  2. service/runtime/RPC status separation
  3. graceful shutdown/restart signalling
  4. accepted → stream → final lifecycle for long agent runs
  5. channel health checks that detect half-dead state, not just crashes

Decision

Prioritize gateway work by availability and operator QoL, not feature breadth.

For the next gateway reliability cycle, the ranked backlog is split into Immediate and Future improvements. Immediate items are the next things to ship or finish. Future items are worth doing, but only after the immediate tier stops biting us.

Ranked Improvements — Immediate

RankImprovementWhy nowEvidenceDesired outcome
1Finish Redis-degraded mode with explicit capability downgradeThis is the highest-leverage unresolved reliability gap. The gateway should keep direct conversation alive when Redis is sick, and say exactly what is degraded.1384, 1385, 0214Gateway remains operator-usable during Redis incidents; status shows mode=redis_degraded rather than pretending everything is fine.

Implementation note (2026-03-06): the first slice is now shipped in code — daemon health exposes mode, transition telemetry exists, gateway status falls back to daemon health, and gateway diagnose treats redis_degraded as degraded instead of dead. Remaining work is broadening degraded-capability handling beyond status/diagnostics. | 2 | Turn gateway guardrails into runtime-enforced contracts, not just prompt text | Prompt discipline helped, but the 2026-03-01 freight-train incident proved that guidance without enforcement is brittle. | 1256, 1251, 1261, 0189 | Automatic checkpoints, action-budget caps, and deploy-verification enforcement emit OTEL and stop runaway autonomy before damage compounds. |

Implementation note (2026-03-06): first runtime slice is now shipped — the daemon sends a forced status checkpoint when a turn exceeds the tool-action budget, auto-schedules vercel ls --yes 2>&1 | head -10 after successful deploy-sensitive git push, exposes live guardrails state in /health and gateway status, and gateway diagnose surfaces active checkpoint / deploy-verification findings. | 3 | Make session pressure first-class in gateway status, notifications, and diagnostics | Context pressure, compaction age, rotation threshold, and thread count still become visible too late. | 1347, 1353, 1372, 1374, 0213 | gateway status/diagnose surfaces context %, last compaction, session age, thread count, fallback streak, and next rotation/compaction threshold. |

Implementation note (2026-03-06): first slice is now shipped — gateway status exposes thread counts, fallback state/activation/failure metadata, pressure reasons, and next-threshold summaries; gateway diagnose adds a dedicated session-pressure layer; and the daemon emits OTEL + Telegram alerts when pressure escalates or recovers. Implementation note (2026-03-09): the next hardening slice now preflights prompt budget before queue dispatch: if projected prompt size would land too close to the live model context ceiling the daemon compacts first, and if session age or projected size still says “too risky,” it rotates to a fresh session with a compression summary before sending the queued prompt. The fallback controller also now treats the live session.model as primary truth so recovery probes stop lying when requested gateway config and actual runtime model diverge. Implementation note (2026-03-09, follow-up): restart continuity now reconciles resumed session model state back to the Redis-configured primary before fallback control initializes, so the gateway cannot silently keep running yesterday’s fallback/manual model after a daemon restart. Session-pressure context-window resolution also now falls back to the active model registry instead of a blind 200k default when the live session object omits that field. | 4 | Ship interruptibility and supersession for human turns | Operators need the latest message to win. /stop, /esc, and channel-specific abort patches were useful, but the core queue model is still too stale-first. | 0103, 0196, plus stop/esc fixes in slog (1247, 1249, 1253) | New human messages can cancel or supersede stale work cleanly; interruption becomes a contract, not a bag of per-channel patches. |

Implementation note (2026-03-06): runtime slices are now shipped — direct human turns across Telegram, Discord, iMessage, and Slack invoke paths use latest-wins supersession keyed by source; stale queued prompts are dropped; the daemon aborts stale active turns; stale late responses are suppressed; a short 1.5s batching window now collapses rapid follow-ups before dispatch; gateway status exposes supersession plus batching state; and gateway diagnose adds an interruptibility layer. Passive intel/background routes still bypass the human batching path, and downstream completion acks for longer operator workflows remain open.

Implementation note (2026-03-09, follow-up): durable queue replay now treats stream identity as part of the latest-wins contract, so the freshest persisted human message cannot self-supersede on dequeue/replay. Supersession still drops genuinely older same-source work, but replaying the newest Telegram/Discord/iMessage/Slack human turn is no longer allowed to discard itself as stale. | 5 | Add end-to-end ACK/timeout tracing for operator actions and callback chains | Silent failure is still the ugliest bug class. The operator should never see a button toast without a real path trace. | 1308, 1366 | Every callback/command gets a trace id, hop-level logs, explicit ack, timeout, and failure surface. No more fake-success UX. |

Implementation note (2026-03-06): runtime slices are now shipped in three steps — first, Telegram operator callback paths (cmd:*, worktree:*, pitch:*, default callback actions, and external callback-route handoffs) plus direct Telegram slash commands and native /stop /esc /kill commands began emitting trace ids with kind=callback|command, gateway status exposed canonical operatorTracing (with callbackTracing kept as a compatibility alias), gateway diagnose gained an operator-tracing layer, and timeout/failure paths started sending explicit Telegram follow-ups; second, queued Telegram agent commands now carry their trace id through the active gateway turn and complete/fail on downstream turn completion, prompt failure, assistant error, or supersession instead of lying at enqueue time; third, externally routed callback chains can now report completed / failed back through a Redis trace-result handoff, and the in-tree Restate Telegram route is wired to close the original trace on real downstream resolution instead of publish-time fiction. Any out-of-tree external callback consumer that does not adopt the handoff will still timeout as untracked work. | 6 | Generalize channel ownership and half-dead health contracts beyond Telegram | Telegram hurt first, so Telegram got the fixes. The same problem class can hit Slack, Discord, iMessage, and future channels. | 1360, 1365, 1366, 1333, 1334, 1337 | Single-owner semantics, stale-connection detection, and health endpoints become reusable channel contracts instead of Telegram-only heroics. |

Implementation note (2026-03-06): runtime slices are now shipped in five steps — first, daemon status exposed a canonical channels surface with reusable runtime health/ownership snapshots for Telegram, Discord, iMessage, and Slack, /health components began reflecting per-channel contract state instead of a Telegram-only boolean, and gateway diagnose gained a dedicated channel-health layer so passive/fallback/half-dead states are visible before a full outage; second, the daemon began detecting configured channel degrade/recover transitions, emitting OTEL under daemon.channel-health, sending immediate Telegram alerts unless the channel is muted as a known issue, and surfacing summarized channelHealth alert state in gateway status / gateway diagnose; third, channelHealth now carries explicit heal policy state (restart / manual / none), gateway diagnose adds channel-healing, and the watchdog can attempt guarded restarts for restart-eligible degraded channels while leaving ownership/lease conflicts visible as manual/operator work; fourth, manual-policy degradations now carry explicit manualRepairRequired, manualRepairSummary, and manualRepairCommands, gateway status/gateway diagnose surface those operator steps directly, and Telegram retrying getUpdates conflicts no longer read as healthy fallback when polling is actually down; fifth, degraded channels that are muted as known issues now also flip to manual with the mute reason as repair guidance instead of falsely advertising a restart policy that the watchdog suppresses while muted. Stricter ownership enforcement beyond Telegram and richer/native repair automation beyond CLI-guided operator steps remain open.

Implementation note (2026-03-10, overnight spam follow-up): low-signal operator spam is now clamped harder at the gateway boundary. restate / restate/* sources count as automation for batching, internal test.gateway-e2e probes are suppressed from operator delivery, heartbeat-only / queue-dispatch-complete-only digests are dropped instead of prompting the model for pointless acknowledgements, and routine fallback swap/recovery notices stop paging Telegram during quiet hours (with recovery notices downgraded to log/OTEL-only unless some higher-signal path escalates them). This keeps ADR-0189’s routing contract honest: high-signal failure states still surface, but low-value success churn no longer burns operator attention or drives overnight provider flap loops.

Implementation note (2026-03-12, signal-pipeline follow-up): operator relay now has a canonical signal-policy surface instead of separate Slack/email vibe checks. Joel-authored non-mention Slack channel traffic is pushed into the Redis event bridge as slack.signal.received, packages/gateway/src/operator-relay.ts owns normalize → score → correlate → route heuristics, vip.email.received is immediate by policy, and lower-signal Slack/email items can batch into a correlated digest grouped by project/contact/conversation keys. Non-heartbeat operator relay also strips leaked HEARTBEAT_OK prefixes before Telegram delivery, so passive project signal stops reading like gateway logs.

Implementation note (2026-03-12, control-loop hysteresis follow-up): the next failure mode was not just noisy relays but noisy control loops. Session-pressure transitions now page Telegram only when critical; elevated / recovered states stay in CLI/OTEL. Session recycle notices are no longer operator-facing by default, quiet-hours direct Knowledge Watchdog Alert messages are suppressed at the gateway edge, proactive compaction now has hysteresis (30m cooldown unless context meaningfully worsens), and fallback recovery requires a minimum dwell on the fallback model before probing primary again. The contract shift is deliberate: graceful autonomous transitions are preferred over narrating every internal correction to the operator.

Implementation note (2026-03-12, monitor-correctness follow-up): fallback timing also needed a correctness fix, not just threshold tuning. Gateway transcripts showed aborted assistant turns with zero tokens followed by absurd multi-minute prompt.latency readings on the next successful turn, which meant stale fallback timing state could survive an aborted message_end. The runtime now clears fallback timeout state immediately on empty/aborted message_end events instead of assuming a later turn_end will always clean up. This keeps ADR-0218 honest: observability should describe reality, not smear one failed turn across the next one.

Implementation note (2026-03-12, maintenance-accounting follow-up): graceful transitions also needed explicit maintenance state, not just lower alert volume. Compaction/rotation windows now register as first-class daemon maintenance (daemon.maintenance.started|completed|failed), gateway status exposes when the session is compacting/rotating instead of merely “waiting for turn_end,” and the idle waiter now extends in bounded 60s slices while maintenance is genuinely active before timing out at a 15m aggregate ceiling. This borrows OpenClaw’s “busy is not failure” posture: the watchdog should treat real maintenance work as work, not as proof the session is dead.

Implementation note (2026-03-12, idle-maintenance follow-up): maintenance also needed to be autonomous when the gateway is quiet. The previous slice modeled compaction/rotation honestly once they were already running, but an idle session could still age into compaction_gap / session_age pressure and do nothing until another turn happened. The watchdog now evaluates idle session-pressure state and triggers the same maintenance lifecycle for time-based pressure: overdue compaction runs without waiting for another message, and age-triggered rotation can happen before the next inbound turn. The contract change is explicit: daemon.maintenance.* is no longer only a post-turn story; idle runtime maintenance is part of ADR-0218’s availability behavior.

Ranked Improvements — Future

RankImprovementWhy laterOpenClaw pattern worth stealing
F1Adopt accepted → stream → final lifecycle for long-running gateway jobsValuable for richer operator UX, but it depends on the immediate availability work first.agent accepted + agent.wait run model
F2Broadcast explicit shutdown/restart events to connected clientsUseful once joelclaw has more first-class WS/native surfaces.shutdown event with restartExpectedMs
F3Separate service state, daemon health, channel health, and session pressure in one canonical status surfacePieces exist already; future work is to make the status model unambiguous and typed end-to-end.OpenClaw gateway status split between supervisor and RPC reachability
F4Add richer control-plane protocol for operator clientsWorth doing after we stop fighting substrate incidents.OpenClaw method/event inventory, agent.wait, node/channel namespaces
F5Add first-class client identity / challenge-response for richer clientsFuture-facing for native/websocket surfaces, not today’s biggest pain.connect.challenge, nonce-bound device identity
F6Codify channel account lifecycle as a reusable manager contractImportant if channel count grows; lower urgency than keeping the current gateway boring and stable.per-account monitor lifecycle + channel health monitor

What this ADR changes right now

1. Reliability priority order

For gateway work, the order is now:

  1. availability
  2. diagnosability
  3. interruptibility
  4. operator clarity
  5. feature growth

Any proposed gateway feature that increases ambiguity, queue staleness, or hidden coupling loses priority to the immediate tier above.

2. ADR grooming decisions

  • [[0189-gateway-guardrails]] should be treated as accepted / partially implemented, not “fully shipped.”
  • [[0196-cancel-on-new-message]] is now accepted / partially implemented — the latest-wins queue/supersession contract landed for Telegram human turns, while batching windows and broader channel rollout remain open.
  • [[0214-gateway-redis-degraded-mode]] is promoted from “good idea” to top immediate reliability priority.

Implementation Plan

Required skills preflight

  • gateway — runtime behavior, daemon contracts, channel lifecycle
  • gateway-diagnose — failure-layer semantics and operator diagnostics
  • telegram — human interruptibility and callback routing on the primary mobile channel
  • system-architecture — substrate boundaries (Redis, worker, launchd, Talon, k8s)
  • o11y-logging — explicit lifecycle telemetry and silent-failure prevention
  • cli-design — status/diagnose/control-plane command UX
  • adr-skill — follow-up ADR grooming as immediate items ship

Phase 1 — close the worst ambiguity

  1. Ship [[0214-gateway-redis-degraded-mode]].
  2. Extend gateway status / gateway diagnose to report:
    • runtime mode
    • degraded capabilities
    • context/session pressure
    • channel-owner state
  3. Emit explicit OTEL mode transitions and guardrail hits.

Phase 2 — make availability-first enforceable

  1. Convert the remaining critical 0189 guardrails from prompt guidance to runtime checks.
  2. Add automatic operator-checkpoint enforcement for long delegated work.
  3. Add a hard deploy-verification tripwire for gateway-initiated web work.

Phase 3 — make the operator path interruptible and truthful

  1. Implement the chosen version of 0196 cancel/supersession behavior for human turns.
  2. Add callback/command trace ids with end-to-end ack + timeout semantics.
  3. Standardize stop/esc/abort behavior across Telegram, iMessage, and future channels.

Phase 4 — generalize channel lifecycle contracts

  1. Lift Telegram ownership/health semantics into reusable channel contracts.
  2. Add half-dead detection rules and restart policies for every long-lived channel.
  3. Keep Talon probes aligned with channel health endpoints.

Verification

  • gateway status can distinguish normal vs redis_degraded and list degraded capabilities.
  • Gateway direct conversation remains alive during a Redis outage.
  • Guardrail hits (runaway autonomy, deploy verification stop, forced checkpoint) emit OTEL events.
  • Session pressure fields are visible in status/diagnose before fallback thrash begins.
  • A human follow-up message can cancel/supersede stale work according to a defined contract.
  • Callback buttons and operator commands have traceable ack/timeout behavior; no fake-success toasts remain.
  • At least one non-Telegram channel uses the generalized ownership/half-dead contract.

Consequences

Positive

  • Gateway work gets prioritized around the failures we are actually having.
  • Reliability work stops competing evenly with shiny feature work.
  • Existing ADRs become easier to reason about because the top unfinished items are ranked, not vague.
  • OpenClaw remains a pattern library, not a replacement system.

Negative

  • Some attractive future features move down the queue.
  • More gateway runtime state becomes explicit and therefore has to be maintained honestly.
  • “Fix it in the prompt” stops being an acceptable answer for recurrent availability bugs.

Non-goals

  • Replacing joelclaw with OpenClaw.
  • Rewriting the gateway into a single giant OpenClaw-style monolith.
  • Shipping native-client pairing/device identity before the immediate reliability tier lands.
  • Treating every proposed gateway feature as urgent just because it is interesting.

More Information

Primary external research note:

  • [[../../Projects/openclaw/gateway-lifecycle-deep-dive]]

Primary local incident chain:

  • [[../../system/log/1256-incident-gateway]]
  • [[../../system/log/1308-fix-gateway]]
  • [[../../system/log/1353-fix-gateway]]
  • [[../../system/log/1365-fix-gateway]]
  • [[../../system/log/1384-fix-gateway]]