ADR-0213accepted

Gateway Session Lifecycle Guards

Context

On 2026-03-05, the gateway entered a fallback thrash loop overnight:

  • Last compaction: 02:02 UTC
  • Thrash started: 03:28 UTC (1.5h later)
  • Duration: ~11 hours continuous thrash
  • 92 fallback activations, 83 timeouts, 128 model swaps
  • Session age: 3 days (since Mar 2), 11 MB, 5969 entries
  • 0 compactions in 12 hours despite 281 new entries

Root cause chain:

  1. Session running 3 days without restart → massive JSONL history
  2. Last compaction at 02:02, then nothing for 12h (model_change entries from fallback swaps may have disrupted compaction threshold calculation)
  3. Growing context → Opus first-token latency exceeded 120s timeout
  4. Fallback activates → model_change entries added → context grows further (positive feedback loop)
  5. Heartbeat/cron events kept firing prompts into bloated session every ~30 min

The existing ADR-0141 proactive health check only triggers on token usage >85%. It has no concept of compaction freshness or session age.

Decision

Add three lifecycle guards to the gateway daemon:

1. Compaction Circuit Breaker (4-hour max gap)

After every turn_end, check time since last compaction. If >4 hours, force session.compact() regardless of token count. This prevents the scenario where 281 entries accumulate without compaction.

  • Track lastCompactionAt — initialized from session history on resume
  • Updated after every successful compact() call
  • Fires in doHealthCheck() before the token-based check

2. Session Age Limit (24-hour max)

After every turn_end, check session age. If >24 hours, create a new session with a compression summary of recent activity.

  • Track sessionCreatedAt at daemon startup
  • Uses existing buildCompressionSummary() for context continuity
  • Alerts via Telegram (silent) when triggered

3. Quiet Hours Auto-Batching (11 PM – 7 AM PST)

During quiet hours:

  • Non-interactive events (automation, default-fallback) are batched instead of immediate
  • Batch digest flush is deferred until wake hours
  • Interactive events (telegram, imessage, slack, discord messages) still process immediately
  • Degradation/error events still process immediately

This prevents token burn on events nobody is watching, and reduces the prompt volume that drives context growth overnight.

Implementation

packages/gateway/src/daemon.ts:

  • lastCompactionAt, sessionCreatedAt tracking variables
  • Compaction circuit breaker in doHealthCheck() (runs after every turn_end)
  • Session age guard in doHealthCheck() (creates fresh session via newSession())

packages/gateway/src/channels/redis.ts:

  • isQuietHours() utility (PST timezone)
  • Quiet hours triage rule in event drain loop
  • Quiet hours check in flushBatchDigest()

Consequences

  • Compaction gap is bounded: maximum 4 hours between compactions, preventing context bloat
  • Session freshness is bounded: maximum 24 hours, preventing multi-day JSONL growth
  • Overnight token burn reduced: quiet hours batching prevents unnecessary prompts
  • Interactive messages always process: human messages are never delayed by quiet hours
  • Existing token-based health check preserved: ADR-0141 still fires for acute spikes