Gateway Session Lifecycle Guards
Context
On 2026-03-05, the gateway entered a fallback thrash loop overnight:
- Last compaction: 02:02 UTC
- Thrash started: 03:28 UTC (1.5h later)
- Duration: ~11 hours continuous thrash
- 92 fallback activations, 83 timeouts, 128 model swaps
- Session age: 3 days (since Mar 2), 11 MB, 5969 entries
- 0 compactions in 12 hours despite 281 new entries
Root cause chain:
- Session running 3 days without restart → massive JSONL history
- Last compaction at 02:02, then nothing for 12h (model_change entries from fallback swaps may have disrupted compaction threshold calculation)
- Growing context → Opus first-token latency exceeded 120s timeout
- Fallback activates → model_change entries added → context grows further (positive feedback loop)
- Heartbeat/cron events kept firing prompts into bloated session every ~30 min
The existing ADR-0141 proactive health check only triggers on token usage >85%. It has no concept of compaction freshness or session age.
Decision
Add three lifecycle guards to the gateway daemon:
1. Compaction Circuit Breaker (4-hour max gap)
After every turn_end, check time since last compaction. If >4 hours, force session.compact() regardless of token count. This prevents the scenario where 281 entries accumulate without compaction.
- Track
lastCompactionAt— initialized from session history on resume - Updated after every successful
compact()call - Fires in
doHealthCheck()before the token-based check
2. Session Age Limit (24-hour max)
After every turn_end, check session age. If >24 hours, create a new session with a compression summary of recent activity.
- Track
sessionCreatedAtat daemon startup - Uses existing
buildCompressionSummary()for context continuity - Alerts via Telegram (silent) when triggered
3. Quiet Hours Auto-Batching (11 PM – 7 AM PST)
During quiet hours:
- Non-interactive events (automation, default-fallback) are batched instead of immediate
- Batch digest flush is deferred until wake hours
- Interactive events (telegram, imessage, slack, discord messages) still process immediately
- Degradation/error events still process immediately
This prevents token burn on events nobody is watching, and reduces the prompt volume that drives context growth overnight.
Implementation
packages/gateway/src/daemon.ts:
lastCompactionAt,sessionCreatedAttracking variables- Compaction circuit breaker in
doHealthCheck()(runs after everyturn_end) - Session age guard in
doHealthCheck()(creates fresh session vianewSession())
packages/gateway/src/channels/redis.ts:
isQuietHours()utility (PST timezone)- Quiet hours triage rule in event drain loop
- Quiet hours check in
flushBatchDigest()
Consequences
- Compaction gap is bounded: maximum 4 hours between compactions, preventing context bloat
- Session freshness is bounded: maximum 24 hours, preventing multi-day JSONL growth
- Overnight token burn reduced: quiet hours batching prevents unnecessary prompts
- Interactive messages always process: human messages are never delayed by quiet hours
- Existing token-based health check preserved: ADR-0141 still fires for acute spikes