Context Windows Are Budgets — Prompt Caching Is the Discount
vault-reader context injection and inference-router model selection both depend on understanding per-model context limits and prompt caching boundaries
Claude’s context windows are large — 200K tokens across the modern model family — but “large” doesn’t mean “infinite,” and that distinction is where real architecture decisions live. The Anthropic context window docs make this explicit: context is a resource, and like all resources, it gets interesting when you have to manage it intentionally rather than just fill it up and hope for the best.
The big unlock buried in the context window story is prompt caching. When your system repeatedly injects the same large context — system prompts, document collections, Vault snippets — caching lets you pay the processing cost once and amortize it across many calls. The economics flip from “per-call expensive” to “first-call expensive, subsequent calls cheap.” That changes the viable design space significantly. Systems that felt cost-prohibitive to run with rich context suddenly become workable.
For something like @joelclaw/vault-reader, which injects Obsidian content into agent calls, this is concrete. The system prompt and Vault snippets can be cached, so the gateway only pays full token cost on cold calls. But it requires structuring injected context so the cache boundary is clean — cached prefix has to be stable and contiguous. Structure matters.
The “lost in the middle” problem is still real even at 200K tokens. Models attend better to content at the beginning and end of the context window; material buried in the center gets less reliable retrieval. Position is an architecture decision, not just a formatting choice. If you’re injecting Vault context via vault-reader, the order you assemble that context shapes what the model actually uses.
Key Ideas
- Context as budget: every token costs money and has a slot limit — knowing per-model limits is required for correct routing in
@joelclaw/inference-router - Prompt caching changes the economics of long-context calls — Anthropic’s caching lets you pay once for stable prefixes (system prompts, injected documents)
- Cache boundaries must be architecturally deliberate — the cached prefix must be stable and contiguous; ad-hoc context assembly breaks caching
- “Lost in the middle” — attention degrades for content in the center of very long contexts; position is not neutral
- Different Claude models have different limits — Haiku/Sonnet/Opus all share 200K but vary on price/speed, so routing logic in the inference-router has to account for this
- Token counting is not optional — staying within limits and estimating costs requires explicit counting before calls, not after failures
Links
- Anthropic: Context Windows — source
- Anthropic: Prompt Caching — the economic unlock
- Anthropic: Token Counting — counting before you send
- “Lost in the Middle” paper (Liu et al., 2023) — the academic basis for position mattering
- ADR-0140: Inference Router — where model selection and context limits intersect in joelclaw
- Anthropic Model Overview — per-model context limits and pricing