Context Window

Q: If a model has a 200K context window, why do I still need RAG?

Two reasons. First, attention is O(n²) — sending 200K tokens costs 200×-1000× more compute than sending 1K. Second, attention quality degrades sharply past the first ~10K tokens — the context rot phenomenon. RAG sends just the relevant slices, which is both cheaper and more accurate than sending everything.

Also known as: context length, context size, token limit

TL;DR

The context window is the maximum number of tokens an LLM can process at once. Modern LLMs span 8K to 1M+, but the effective window — where attention quality stays high.

The context window is the maximum number of tokens an LLM can process in a single call — input plus output combined. It’s the hardest hard limit in the LLM stack: try to send more, and the API rejects the request or silently truncates.

Stated context is a hard ceiling; effective context is a soft one — and production systems are rate-limited by the soft ceiling, not the hard one.

Why it’s bounded

The constraint is the attention mechanism’s O(n²) compute and memory cost in sequence length. Every doubling of the window quadruples the work. Models with very long context (Gemini 1.5 Pro at 1M tokens, Claude with 200K, GPT-4 at 128K) use various tricks — sparse attention, sliding-window patterns, KV-cache optimizations — to make long context tractable, but the underlying scaling is unforgiving.

Stated vs effective context

The number you see in the API docs is the stated maximum. The effective context — where the model actually pays close attention to what’s there — is usually much smaller. This degradation, broadly called context rot , was first quantified in the Lost in the Middle paper (Liu et al., 2023): models reliably attend to the first ~10% and last ~10% of their context, and progressively miss things in the middle.

So a 200K-window model in practice has maybe ~20K tokens of high-fidelity attention. Beyond that, you’re running on degraded recall — the information is technically present, but the model is not reliably using it.

The middle-position component of context rot is partly architectural and partly distributional. On the architectural side, every transformer attention head sees the full sequence in principle — the math is uniform across positions. But the training distribution isn’t uniform: pretraining and instruction-tuning data weights early and late tokens more heavily because tasks tend to put the question or the answer at the boundaries.

Positional encoding contributes too. Modern RoPE -based encodings extrapolate imperfectly past the training context length, and the failure modes concentrate in the middle of long sequences rather than at the ends. RoPE-extension techniques (NTK-aware scaling, YaRN, dynamic NTK) push the point at which middle-attention degrades further out, but they don’t eliminate it.

The practical implication is that the position of a fact in your prompt matters as much as its inclusion. Reranking documents before you concatenate them — placing the highest-scoring at the top — is a free quality lift that costs nothing at inference.

Why this matters for RAG

Two practical implications:

Long context doesn’t replace RAG . The intuition “I’ll just dump all my documents into a 1M-token window and let the LLM figure it out” runs into both cost (per-call price scales with token count) and quality ( context rot ). Filtering to the relevant ~20K tokens via retrieval and reranking outperforms cramming.
Position in the prompt matters. Order your retrieved documents by reranker score, with the most relevant at the top of the prompt. The LLM will pay them the most attention.

How to budget the window in production

A typical RAG prompt structure that respects the context window:

Token budget for a typical RAG call

System prompt (~500 tokens) — instructions, tone, output format.
Retrieved context (variable) — the chunked , reranked, compressed documents. Aim for 5K-15K tokens of high-relevance material.
Conversation history (~1K-5K tokens) — prior turns if it’s a chat app.
User query + reasoning headroom (a few hundred to a few thousand tokens for the model’s output).

If you’re at 30K tokens of input on a 200K window, you have plenty of stated room — but you should still question whether all of that input is genuinely useful. Compression buys you both money and quality.

When the window changes mid-deployment

LLM providers periodically extend the maximum context. The constraint to watch isn’t whether your prompts fit; it’s whether your prompts are cheap and your information sits in the parts the model actually attends to. Both are independent of the stated maximum.

Go further

If a model has a 200K context window, why do I still need RAG?

Two reasons. First, [attention](/concepts/attention/) is O(n²) — sending 200K tokens costs 200×-1000× more compute than sending 1K. Second, attention quality degrades sharply past the first ~10K tokens — the [context rot](/concepts/context-rot/) phenomenon. RAG sends just the relevant slices, which is both cheaper and more accurate than sending everything.

RAG Context compression Attention

What's context rot and how do I fight it?

Empirically, LLMs pay closest attention to tokens near the start and end of their context and miss things in the middle — the canonical case is the U-shaped position bias documented by Liu et al. (2023) under the name 'lost in the middle.' A relevant fact buried at position 50,000 of a 100,000-token context can be functionally invisible. Reranking helps by ensuring the most relevant docs are at the top of the prompt, where attention concentrates.

Context rot Reranker Score calibration

Are 1M-token windows actually useful?

Sometimes — for tasks where the relevant information genuinely is spread across the full corpus and synthesis is the bottleneck. More commonly the 'long context replaces RAG' framing is misleading: even when content fits, the per-call cost and attention quality argue for filtering down to the relevant slice anyway. Long context is a tool, not a substitute for retrieval.

Context compression RAG

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs