Opus Token Reduction Across joelclaw
Status
accepted — Phase 0 (emergency model fix + enforcement) implemented 2026-02-20.
Incident: $1000 in 2 Days
The gateway daemon was configured with PI_MODEL="claude-opus-4-20250514" — a dated snapshot ID resolving to Opus 4 pricing ($15/$75 per MTok). The correct model is claude-opus-4-6 ($5/$15 per MTok). This 5x output cost multiplier, combined with 50-100 events/day through Opus, burned ~$1000 in approximately 2 days.
Immediate Fixes (2026-02-20)
- gateway-start.sh: Fixed model to
claude-opus-4-6, addedALLOWED_MODELSallowlist — gateway refuses to start if model not on list - vip-email-received.ts:
opus-4-1→opus-4-6(was defaulting to $15/$75 tier) - batch-review.ts: bare
claude-haiku→claude-haiku-4-5(explicit version) - lib/models.ts: Centralized
MODELregistry withassertAllowedModel()guard, pricing comments, semantic aliases (MODEL.OPUS,MODEL.SONNET,MODEL.HAIKU,MODEL.CODEX,MODEL.O4_MINI,MODEL.O3)
Model Enforcement Rules
- No dated snapshot IDs (e.g.
claude-opus-4-20250514) — these silently resolve to expensive legacy tiers - No bare aliases (e.g.
claude-haiku) — always specify version - All models must be in
ALLOWED_MODELSor the gateway won’t start - Use
MODEL.*constants fromlib/models.tsin all Inngest functions - OpenAI models included:
gpt-5.3-codex,o4-mini,o3— used by codex exec, agent loops, friction-fix
Context
The joelclaw system burns Opus tokens at an alarming rate. The central gateway daemon runs claude-opus-4-6 with thinking:low and every inbound event — heartbeat, media notification, email triage result, Todoist echo, content sync — gets turned into a prompt that hits Opus for inference.
Current Token Burn Points
| Component | Model | Trigger | Frequency | Token Impact |
|---|---|---|---|---|
| Gateway daemon | Every non-suppressed event | ~50-100/day | HUGE — full session context + event prompt (5x cheaper after fix) | |
| Gateway heartbeat | Hourly cron | 24/day | Large — reads HEARTBEAT.md checklist | |
| VIP email analysis | Inbound VIP email | ~5-10/day | Medium — full email analysis (5x cheaper after fix) | |
| Media vision-describe | Haiku | Photo/video received | ~5/day | Small — Haiku is cheap |
| Observer (memory) | Haiku | Session compaction | ~5-10/day | Small |
| Reflector (memory) | Haiku | After observation | ~5-10/day | Small |
| Batch review (memory) | Hourly | 24/day | Small | |
| Email triage | Sonnet 4.6 | Hourly check | 24/day | Medium |
| Task triage | Sonnet 4.6 | Hourly heartbeat | 24/day | Medium |
| Friction analysis | Sonnet 4.6 | After reflect | ~5/day | Small |
| Meeting analysis | Sonnet | After Granola sync | ~2/day | Medium |
| Daily digest | Sonnet | Daily | 1/day | Medium |
| Email cleanup | Haiku | On demand | Rare | Small |
The Core Problem
The gateway daemon is a pi session running Opus. Every event gets buildPrompt() → enqueue() → session.prompt(). This means:
- Session context grows — pi maintains conversation history, so each prompt carries all prior context
- Opus processes everything — a simple “media processed” notification costs the same as a complex triage decision
- No routing by complexity — a heartbeat OK and a VIP email analysis both go through the same Opus pipeline
- Suppression is the only control — events are either fully processed by Opus or fully dropped. No middle ground.
Decision Drivers
- Opus is ~15x more expensive than Haiku per token
- Most gateway events need routing/formatting, not reasoning
- The gateway session context balloons over time (no compaction for daemon)
- Joel wants the system autonomous but cost-aware
Considered Options
Option A: Tiered Model Routing in Gateway
Route events to different models based on complexity:
- Pass-through (no LLM): Media descriptions, task echoes, content sync — just format and forward to Telegram. No inference needed.
- Haiku: Heartbeat checks, simple triage, acknowledgments
- Sonnet: Email summaries, meeting analysis, multi-event digests
- Opus: Only for complex decisions requiring full context — VIP emails, ambiguous requests, multi-step reasoning
Implementation: Add a routeEvent() function before buildPrompt() that classifies events and either (a) formats directly for Telegram without LLM, (b) uses a lightweight pi session, or (c) falls through to the Opus session.
Option B: Gateway as Dumb Router + Smart Workers
Strip all LLM inference from the gateway daemon. Make it a pure event router:
- Gateway receives events, formats them, forwards to Telegram — no
session.prompt()at all - Complex decisions stay in Inngest functions (which already use appropriate models)
- Telegram user messages → fire Inngest event → function handles with right model → gateway.notify() delivers response
This is the logical endpoint of “heartbeat as pure fan-out” (existing pattern).
Option C: Session Compaction for Gateway
Keep Opus but add session compaction to the gateway daemon to prevent context growth. Reduce heartbeat frequency. Batch more events.
Proposed Decision
Option B (Gateway as Dumb Router) with elements of Option A as a migration path.
Rationale
The gateway daemon already has the “bias-to-action triangle” (IMMEDIATE / BATCHED / SUPPRESSED). The missing tier is PASS-THROUGH — events that need delivery but not inference.
Most gateway.notify() calls from Inngest functions already contain fully-formed messages (the media-process function formats the description, the email triage formats the summary). The gateway just needs to deliver them to Telegram, not re-analyze them with Opus.
Migration Path
-
Phase 1: Pass-through delivery — Events with a
payload.messagefield get formatted and sent directly to Telegram. Nosession.prompt(). This coversgateway.notify(),gateway.progress(),gateway.alert(). -
Phase 2: Heartbeat demotion — Heartbeat system checks already run as independent Inngest functions. Gateway heartbeat becomes: receive results, format summary, send to Telegram. Haiku at most.
-
Phase 3: Telegram → Inngest — User messages from Telegram fire an Inngest event instead of hitting
session.prompt(). A dedicated function handles the conversation with the appropriate model. Response flows back viagateway.notify(). -
Phase 4: Remove Opus from gateway — Gateway runs without an LLM session. Pure event router + Telegram delivery.
What Stays on Opus
- Interactive pi sessions (this terminal, SSH sessions) — user is present, Opus quality matters
- VIP email analysis (already a separate function, can stay Opus or drop to Sonnet)
What Moves to Cheaper Models
- All gateway event processing → pass-through (no LLM)
- Heartbeat → pass-through or Haiku
- Telegram conversations → Sonnet (good enough for most interactions)
Consequences
Positive
- Dramatic cost reduction (estimated 80-90% of gateway token spend eliminated)
- Gateway becomes faster (no waiting for Opus inference per event)
- Gateway becomes more reliable (fewer failure modes, no session state issues)
- Clear separation: routing ≠ reasoning
Negative
- Telegram conversations lose Opus-level reasoning (mitigated: can escalate to Opus for complex queries)
- More code in gateway for formatting/routing logic
- Migration requires careful testing of each event type
Risks
- Some events may need more intelligence than pass-through — need escape hatch to LLM
- Telegram conversation quality may noticeably degrade on Sonnet vs Opus
Research: OpenClaw Token Optimization Patterns
OpenClaw (~/Code/openclaw/openclaw) manages token costs across a multi-channel agent platform. Relevant patterns from their codebase and git history (credit: Peter Steinberger / OpenClaw contributors):
1. Heartbeat Model Override (heartbeat.model)
OpenClaw lets you run heartbeats on a different, cheaper model than the session primary. Config: agents.defaults.heartbeat.model: "anthropic/claude-haiku". The main session keeps Opus for user interactions; heartbeats use Haiku. This is exactly what joelclaw needs — the gateway’s hourly heartbeat doesn’t need Opus.
Commit: 4200782a5 — “fix(heartbeat): honor heartbeat.model config for heartbeat turns”
2. Heartbeat Transcript Pruning
When a heartbeat run produces HEARTBEAT_OK (no action needed), OpenClaw truncates the transcript back to pre-heartbeat size. Zero-information exchanges are pruned from history, preventing context window pollution from routine checks.
Commit: e9f2e6a82 — “fix(heartbeat): prune transcript for HEARTBEAT_OK turns”
joelclaw equivalent: The gateway daemon never prunes its session. Every heartbeat acknowledgment accumulates in context, making subsequent Opus calls progressively more expensive.
3. Session Pruning (Cache-TTL Aware)
OpenClaw prunes old tool results from in-memory context before each LLM call. Two levels:
- Soft-trim: Keep head + tail of oversized tool results, insert
... - Hard-clear: Replace entire tool result with
[Old tool result content cleared]
Configurable per-tool (tools.allow/tools.deny), with TTL-aware triggering to avoid re-caching full history when Anthropic prompt cache expires.
Defaults: keepLastAssistants=3, softTrim maxChars=4000, hardClear enabled
joelclaw equivalent: The gateway session never trims tool results. Long bash outputs, file reads, and API responses persist in full across the entire session lifetime.
4. Image Dimension Reduction
OpenClaw reduced default image max dimension from 2000px to 1200px. Vision tokens scale with image size — smaller images = fewer tokens with sufficient detail for most use cases.
Commit: 5ee79f80e — “fix: reduce default image dimension from 2000px to 1200px”
joelclaw equivalent: media-process sends images to Haiku (already cheap), but if the gateway ever processes images directly, this matters.
5. Skill Path Compaction
Replace absolute home directory in skill <location> tags with ~. Saves ~5-6 tokens per path × 90+ skills = 400-600 tokens per system prompt.
Commit: 4f2c57eb4 — “feat(skills): compact skill paths with ~ to reduce prompt tokens”
joelclaw equivalent: The gateway’s pi session loads all skills into the system prompt with full absolute paths. Direct savings opportunity.
6. Bootstrap File Truncation
System prompt workspace files (AGENTS.md, MEMORY.md, etc.) are truncated at configurable limits:
- Per-file:
bootstrapMaxChars(default: 20,000) - Total:
bootstrapTotalMaxChars(default: 150,000)
joelclaw equivalent: AGENTS.md alone is ~15K chars. MEMORY.md is growing. No truncation in gateway.
7. Compaction Reserve Tokens Floor
Minimum 20,000 tokens reserved for compaction summaries. Prevents compaction from producing too-small summaries that lose critical context.
8. Cron Usage Tracking
OpenClaw has scripts/cron_usage_report.ts — parses JSONL run logs to produce per-job, per-model token usage reports with input/output/cache breakdowns. Essential for identifying which cron jobs are most expensive.
joelclaw equivalent: No per-function token tracking. We can’t currently measure which Inngest functions or gateway events cost the most.
9. Active Hours Window
agents.defaults.heartbeat.activeHours: { start: "09:00", end: "22:00" } — heartbeats only run during waking hours. No overnight token burn for a system nobody’s watching.
joelclaw equivalent: Heartbeat runs 24/7. Sleep mode exists but is manual. Active hours would eliminate ~8 hours of heartbeat token spend automatically.
10. Cache Warming via Heartbeat Interval
If Anthropic cache TTL is 1h, setting heartbeat to 55min keeps the prompt cache warm — subsequent requests are cache reads (cheap) instead of cache writes (expensive). OpenClaw documents this explicitly in their token-use guide.
joelclaw equivalent: Not leveraging prompt caching at all. The gateway daemon doesn’t configure cacheRetention or align heartbeat timing with cache TTL.
11. Model Fallback Chain
OpenClaw has a full model failover system — on 429/503/timeout, it falls to the next model in the chain. Subagent spawns can override the model. Context overflow errors are specifically excluded from fallback (switching to a smaller-context model on overflow would be counterproductive).
Commit: b8f66c260 — “Agents: add nested subagent orchestration controls and reduce subagent token waste”
Summary: Applicable Techniques for joelclaw
| Technique | Effort | Impact | Phase |
|---|---|---|---|
| Pass-through delivery (no LLM for formatted events) | Medium | Huge | 1 |
| Heartbeat model demotion (Opus → Haiku) | Low | High | 1 |
| Active hours window (no overnight heartbeat) | Low | Medium | 1 |
| Heartbeat transcript pruning (HEARTBEAT_OK → prune) | Medium | Medium | 2 |
| Session pruning (trim old tool results) | Medium | Medium | 2 |
| Skill path compaction (~ instead of /Users/joel) | Low | Small | 1 |
| Bootstrap file truncation | Low | Small | 1 |
| Per-function token tracking | Medium | Diagnostic | 2 |
| Prompt cache warming (align heartbeat with TTL) | Low | Medium | 2 |
| Gateway → dumb router (full Opus removal) | High | Huge | 3 |