Context Window

Also known as: context length, context size, token limit

TL;DR

The context window is the maximum number of tokens an LLM can process at once. Modern LLMs span 8K to 1M+, but the effective window — where attention quality stays high.

The context window is the maximum number of an can process in a single call — input plus output combined. It’s the hardest hard limit in the LLM stack: try to send more, and the API rejects the request or silently truncates.

CONTEXT WINDOWA rolling cap on what the model can see at once.WINDOW · 32 TOKEN CAPSCROLLING LEFT AS NEW TOKENS ARRIVE191725LOST IN THE MIDDLEATTENTION COST · O(n²)n attention work2n attention work4n16× attention work8n64× attention workEach doubling of n quadruples the work.THE LANDSCAPE4K4K TOK · GPT-3 ERAattn ∝ 32K32K TOK · GPT-4 / CLAUDE 2attn ∝ 64×200K200K TOK · CLAUDE 3+attn ∝ 3K×1M1M TOK · GEMINI 1.5 PROattn ∝ 63K×

Stated context is a hard ceiling; effective context is a soft one — and production systems are rate-limited by the soft ceiling, not the hard one.

Why it’s bounded

The constraint is the mechanism’s O(n²) compute and memory cost in sequence length. Every doubling of the window quadruples the work. Models with very long context (Gemini 1.5 Pro at 1M tokens, Claude with 200K, GPT-4 at 128K) use various tricks — sparse attention, sliding-window patterns, KV-cache optimizations — to make long context tractable, but the underlying scaling is unforgiving.

Stated vs effective context

The number you see in the API docs is the stated maximum. The effective context — where the model actually pays close attention to what’s there — is usually much smaller. This degradation, broadly called , was first quantified in the Lost in the Middle paper (Liu et al., 2023): models reliably attend to the first ~10% and last ~10% of their context, and progressively miss things in the middle.

So a 200K-window model in practice has maybe ~20K tokens of high-fidelity attention. Beyond that, you’re running on degraded recall — the information is technically present, but the model is not reliably using it.

The middle-position component of is partly architectural and partly distributional. On the architectural side, every transformer attention head sees the full sequence in principle — the math is uniform across positions. But the training distribution isn’t uniform: pretraining and instruction-tuning data weights early and late tokens more heavily because tasks tend to put the question or the answer at the boundaries.

Positional encoding contributes too. Modern -based encodings extrapolate imperfectly past the training context length, and the failure modes concentrate in the middle of long sequences rather than at the ends. RoPE-extension techniques (NTK-aware scaling, YaRN, dynamic NTK) push the point at which middle-attention degrades further out, but they don’t eliminate it.

The practical implication is that the position of a fact in your prompt matters as much as its inclusion. Reranking documents before you concatenate them — placing the highest-scoring at the top — is a free quality lift that costs nothing at inference.

Why this matters for RAG

Two practical implications:

  1. Long context doesn’t replace . The intuition “I’ll just dump all my documents into a 1M-token window and let the LLM figure it out” runs into both cost (per-call price scales with token count) and quality ( ). Filtering to the relevant ~20K tokens via retrieval and reranking outperforms cramming.
  2. Position in the prompt matters. Order your retrieved documents by score, with the most relevant at the top of the prompt. The LLM will pay them the most attention.

How to budget the window in production

A typical RAG prompt structure that respects the context window:

Token budget for a typical RAG call
  • System prompt (~500 tokens) — instructions, tone, output format.
  • Retrieved context (variable) — the , reranked, documents. Aim for 5K-15K tokens of high-relevance material.
  • Conversation history (~1K-5K tokens) — prior turns if it’s a chat app.
  • User query + reasoning headroom (a few hundred to a few thousand tokens for the model’s output).

If you’re at 30K tokens of input on a 200K window, you have plenty of stated room — but you should still question whether all of that input is genuinely useful. Compression buys you both money and quality.

When the window changes mid-deployment

LLM providers periodically extend the maximum context. The constraint to watch isn’t whether your prompts fit; it’s whether your prompts are cheap and your information sits in the parts the model actually attends to. Both are independent of the stated maximum.

Go further

If a model has a 200K context window, why do I still need RAG?

Two reasons. First, [attention](/concepts/attention/) is O(n²) — sending 200K tokens costs 200×-1000× more compute than sending 1K. Second, attention quality degrades sharply past the first ~10K tokens — the [context rot](/concepts/context-rot/) phenomenon. RAG sends just the relevant slices, which is both cheaper and more accurate than sending everything.

What's context rot and how do I fight it?

Empirically, LLMs pay closest attention to tokens near the start and end of their context and miss things in the middle — the canonical case is the U-shaped position bias documented by Liu et al. (2023) under the name 'lost in the middle.' A relevant fact buried at position 50,000 of a 100,000-token context can be functionally invisible. Reranking helps by ensuring the most relevant docs are at the top of the prompt, where attention concentrates.

Are 1M-token windows actually useful?

Sometimes — for tasks where the relevant information genuinely is spread across the full corpus and synthesis is the bottleneck. More commonly the 'long context replaces RAG' framing is misleading: even when content fits, the per-call cost and attention quality argue for filtering down to the relevant slice anyway. Long context is a tool, not a substitute for retrieval.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord