Context Windows Are a Lie (Sort Of)
Every new model announcement leads with the same headline: bigger context window. 100K tokens. 200K tokens. A million tokens. The implication is clear — more context means better performance. Just throw everything in and let the model figure it out.
I bought into this for a while. Then I actually tried it.
The Needle in a Haystack Problem
There's a well-known benchmark called "needle in a haystack" that tests whether a model can find a specific piece of information buried in a large context. Most modern models pass this test with flying colors. Great. But here's what the benchmark doesn't test: can the model use that information correctly in combination with everything else in the context?
Finding a needle is easy. Threading it while juggling is hard.
When I was building Mimir with a single-agent architecture, I had everything in one context: system prompt, tool definitions, conversation history, memory retrievals, session state. The model could technically "see" all of it. But its ability to act on any specific piece of information degraded as the total context grew. Not because the information wasn't there — because the signal-to-noise ratio collapsed.
A 200K context window doesn't mean the model can effectively use 200K tokens. It means it can hold 200K tokens while effectively using maybe 20% of them.
What I Actually Observed
Three specific failure patterns showed up as context grew:
- Instruction amnesia. Instructions at the top of the system prompt got ignored when the context was long enough. The model would follow instructions from the last few thousand tokens and forget rules defined 50,000 tokens earlier. This is well-documented in research — attention degrades with distance.
- Tool confusion. With 80+ tool definitions in context, the model would occasionally call the wrong tool or pass parameters meant for one tool to another. Not because it couldn't read the definitions — because it was processing too many similar-looking definitions simultaneously.
- Context bleed. Information from one part of the context would leak into responses about something else. I'd ask about a calendar event and get details mixed in from an email that happened to be in the conversation history. The model wasn't hallucinating — it was cross-referencing things it shouldn't have been.
The Architecture Fix
The solution wasn't to use less context. It was to use the right context at the right time. This is the principle behind the three-agent pipeline I described in an earlier article, but it applies more broadly.
Every stage of processing should receive only the context it needs:
- The Planner gets full context because it needs to understand the big picture
- The Worker gets minimal context because it needs to execute precisely
- The Synthesizer gets results plus conversation tone, not the full history
But beyond the pipeline, I also implemented aggressive context management at every level:
Conversation history pruning. Not every message in the conversation is equally important. System messages, tool call details, and routine confirmations get summarized or dropped. The actual user messages and key assistant responses are preserved verbatim.
Dynamic tool loading. Instead of sending all 80+ tool definitions in every request, the Planner identifies which tools are likely needed and only those definitions get sent to the Worker. A calendar request doesn't need to see the code execution tool definitions.
Memory injection budgets. Each request gets a maximum number of memory tokens. If semantic search returns 15 relevant memories, they get ranked and only the top N that fit within the budget are injected. The rest are available if the Planner explicitly asks for more.
The goal isn't to maximize context. It's to maximize the ratio of useful context to total context.
Practical Takeaways
If you're building AI applications and relying on large context windows to solve your problems, here's what I'd suggest:
- Measure effective context, not total context. How much of what you're sending is actually being used in the response? If you're sending 50K tokens and the model is only meaningfully engaging with 5K, you have a 90% waste problem.
- Put critical instructions at the end. Due to recency bias in attention, instructions closer to the actual query get more weight. If you have non-negotiable rules, repeat them near the end of the context.
- Separate concerns into separate calls. If a task requires both understanding context and executing precisely, those should be different LLM calls with different context.
- Budget your context like you budget compute. Every token has a cost — not just in dollars, but in attention. Treat context as a scarce resource, not an infinite buffer.
The models will keep getting bigger context windows. That's great. But architecture will always matter more than capacity. A well-structured 10K context will outperform a poorly structured 200K context every single time.
Part of a series on building reliable AI applications. Previous: AI Memory Systems.
Also worth reading: Planner, Worker, Synthesizer — the architecture that solves the context noise problem described here.