Autoregressive Generation

Also known as: autoregressive decoding, AR generation, token-by-token generation

TL;DR

Autoregressive generation is the token-by-token loop that decoder LLMs use to produce text: predict the next token from everything generated so far, sample, append, repeat.

Autoregressive generation is how every produces text. The model predicts a probability distribution over the next token given everything seen so far, samples one token from that distribution, appends it to the sequence, and repeats. Stop on an end-of-sequence token or a length limit. It’s a structurally simple loop — and the main reason LLM latency, cost, and throughput look the way they do.

AUTOREGRESSIVE GENERATIONOne token at a time, conditioned on everything so far.SEQUENCE · PROMPT THEN GENERATEDPROMPT · 5 TOKGENERATED · 0 / 5Thecatsatonthe12345NEXT-TOKEN DISTRIBUTION · STEP 1p(· | x1..5)"mat".42"floor".18"sofa".13"rug".09"roof".06SAMPLED · "mat" · p = 0.42THE LOOPCONTEXTx₁ … xₙLOGITSzₙ ∈ ℝ^|V|SOFTMAXp = σ(zₙ)SAMPLExₙ₊₁ ~ pAPPENDONE PASS · ONE TOKENlatency ∝ output length

The loop

Given a prompt , the model generates output as follows:

  1. Run the prompt through the model. The final layer produces over the vocabulary at position .
  2. Convert to a probability distribution: .
  3. Sample from this distribution (or take its argmax for greedy decoding).
  4. Append to the sequence.
  5. Run the model again on the extended sequence to get logits at the new last position.
  6. Sample , append, repeat.

Stop when the model produces an end-of-sequence token or a maximum length is reached.

Why it’s the bottleneck

Reading the prompt and generating each new token are computationally very different:

  • Prefill processes all prompt tokens in parallel — a few large matrix multiplies. GPUs love this.
  • Decode generates each new token sequentially — token requires token ‘s output. No parallelism within a request.

This is why a 1000-token prompt with a 50-token answer takes much less time than a 50-token prompt with a 1000-token answer, even though the total work looks similar. Per-token decode latency is roughly constant; total latency scales with output length.

The KV cache, briefly

Naive autoregressive generation re-runs the full transformer over the entire sequence at every step — work to generate tokens. The avoids this by storing the keys and values from previous positions; each new step only computes , , for the new token and reads cached , for prior positions. This drops generation to total work — the reason inference is even feasible.

Sampling strategies

SAMPLING STRATEGIES · SHARED LOGITSSame distribution, four ways to choose.one next-token distribution, four selection masks.GREEDYargmaxmatsofafloorrugbedroofchairtabledesklapdeterministic — can loop.TEMPERATURET = 1.0matsofafloorrugbedroofchairtabledesklapevery token has a chance.TOP-Pp = 0.7matsofafloorrugbedroofchairtabledesklapcut where cumulative mass = p.TOP-Kk = 5matsofafloorrugbedroofchairtabledesklapkeep the k highest, regardless of mass.Rose-gold bars are kept; grey bars are rejected before the draw.

The choice of how to sample from the next-token distribution shapes the output:

  • Greedy. Always pick the argmax. Deterministic, but often dull or repetitive.
  • Temperature. Scale logits by before softmax. sharpens (more confident, less diverse); flattens.
  • Top-k. Restrict to the highest-probability tokens before sampling.
  • Top-p (nucleus). Restrict to the smallest set of tokens whose probabilities sum to at least .
  • Beam search. Maintain candidate sequences, expand each, keep the best by joint probability. Better for translation and code; rarely used for open-ended chat.

For RAG and retrieval-adjacent uses, low-temperature or greedy decoding is typical — you want determinism, not creative variation.

Implications for production

Per-token cost is the structural reality of LLMs. It’s why output-length budgeting matters in prompts, why a that surfaces the right 5K tokens out of 1M wins (the LLM emits a short, on-target answer instead of plowing through everything), and why specialized small models — single score or vector in one parallel forward pass, no loop — are so much faster than asking an LLM to do the same task.

A small “draft” model generates 4 candidate tokens autoregressively (4 × 5ms = 20ms). The big “target” model then runs one parallel forward pass over those 4 tokens and produces logits for each position — costing roughly the same as generating one token alone (~50ms).

Compare the draft tokens against the target’s distribution at each position. Where they agree, accept; at the first disagreement, take the target’s choice and discard the rest. Best case: 4 tokens for the price of one big-model pass (4× speedup). Worst case: 1 token plus 20ms of draft wasted.

The win depends on draft-target agreement, which depends on distillation quality. Production stacks (vLLM, llama.cpp, NVIDIA Triton) typically get 2-3× wall-clock speedups. Generation is memory-bandwidth-bound, and a parallel forward pass uses the same bandwidth as a single-token decode while producing multiple tokens of progress.

T=0 (greedy) always picks the argmax. This sounds like the highest-probability sequence but isn’t — it’s only locally optimal. The globally most-probable sequence might need a slightly lower-probability token at step 5 to unlock a much higher-probability continuation.

For constrained generation (structured output, code, math), argmax can paint the model into a corner where the only valid continuation has near-zero probability and the model loops or emits garbage. Low temperature (T=0.1 to 0.3) escapes these dead ends without meaningful diversity loss.

T=0 also breaks self-consistency and similar majority-vote techniques: N samples become N identical answers. Use T=0 for RAG, classification, fact-extraction. Use T=0.5-0.7 for open-ended generation. Test your workload.

Go further

Why is generation so much slower than reading the prompt?

Reading the prompt (the prefill phase) processes all input tokens in parallel — one big matrix multiply. Generation (the decode phase) is fundamentally sequential: each token depends on the previous one. Parallelism vanishes; you're paying per token. The [KV cache](/concepts/kv-cache/) makes each step cheap, but you can't skip steps.

What sampling strategies sit on top of autoregressive generation?

Greedy (always take the argmax), temperature sampling (scale [logits](/concepts/logits/) by before softmax, then sample), top-k (sample from the k highest-prob tokens), top-p / nucleus (sample from the smallest prefix that exceeds probability mass ). Beam search keeps multiple partial candidates for higher-quality output at extra compute cost.

Can you parallelize autoregressive generation?

Partially. Speculative decoding has a small fast model propose several tokens, then the big model verifies them in one parallel forward pass — accepting all that match. This trades a draft model's quality for parallelism in the big model and yields ~2-3× speedups in practice.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord