Large Language Model (LLM)

Also known as: LLM, foundation model, frontier model

TL;DR

A large language model is a transformer-based neural network trained on vast text corpora to predict the next token. Modern LLMs (GPT, Claude, Gemini) are general-purpose reasoning engines.

A large language model (LLM) is a neural network trained to predict the next token in a sequence, given everything before. Train one on trillions of tokens of books, code, web pages, and papers and the network develops a startlingly general capability for language understanding, reasoning, and generation.

LARGE LANGUAGE MODEL · NEXT-TOKEN PROBABILITYTokens in, distribution over the next token out.INPUTtokensEMBEDE[token id]STACKN transformerblocksUNEMBEDh → logitsOUTPUTp(next-token)POS 0"The"POS 1"cat"POS 2"sat"POS 3"on"POS 4"the"x_ℓBLOCK 1ℓ = 1MULTI-HEAD ATTENTIONFEED-FORWARD (MLP)EARLYsurface form, token id featuresBLOCK 2ℓ = 2MULTI-HEAD ATTENTIONFEED-FORWARD (MLP)EARLYsimple syntax, basic coreferenceBLOCK 3ℓ = 3MULTI-HEAD ATTENTIONFEED-FORWARD (MLP)MIDDLEphrase composition, NERBLOCK 4ℓ = 4MULTI-HEAD ATTENTIONFEED-FORWARD (MLP)MIDDLEsemantic roles, factual recallBLOCK 5ℓ = 5MULTI-HEAD ATTENTIONFEED-FORWARD (MLP)LATEintent + answer planningBLOCK 6ℓ = 6MULTI-HEAD ATTENTIONFEED-FORWARD (MLP)LATEnext-token routingFINAL RESIDUAL · x_L AT LAST POSITIONa single d-dim vector summarizing "everything so far"LN · W_U · SOFTMAXz = W_U · LN(x_L)0.62"mat"0.18"rug"0.08"floor"0.05"roof"0.04"lap"0.03"sofa"ARGMAX · APPEND TO SEQUENCEthe cat sat on the "mat"AT FRONTIER SCALEN ≈ 32 – 120 blocksd ≈ 4 096 – 16 384vocab ≈ 100 K – 250 Kcontext ≈ 8 K – 1 M tokensTRAINING OBJECTIVEmaximize log p(x_t | x_<t)next-token cross-entropy,≈ 1–15 T training tokens.

The “large” matters. Small language models existed for decades; what makes LLMs qualitatively different is scale. At ~10B+ parameters trained on ~1T+ tokens, models cross thresholds where they handle tasks they were never explicitly trained on — translation, code generation, multi-step reasoning, instruction following — purely as a side effect of next-token prediction at scale.

How they work, in one paragraph

LLMs are almost universally architectures. Input text is broken into , each token is embedded into a high-dimensional vector, and stacked transformer layers progressively transform those vectors via so each position can read information from any other. The final layer produces a probability distribution over the next token; sampling repeatedly produces text. Training is supervised next-token prediction on huge corpora, then optionally for specific behaviors and RLHF for alignment.

LARGE LANGUAGE MODEL · WHY SCALE MATTERSLoss falls as a power law in parameters.1B10B100B1T1.11.31.62.02.5parameters · log scalevalidation loss · log scaleLOSS VS PARAMS · LOG-LOGIRREDUCIBLE LOSSL → E as N → ∞GPT-21.5B paramsCHINCHILLA-OPTIMAL~70B params · 1.4T tokensGPT-3175B paramsGPT-4-CLASS~1.8T params (MoE)compute-optimalTOKENS ≈ 20 × PARAMSLog-log axes; each doubling of params shaves roughly the same fraction off loss until data limits kick in.L(N) ≈ E + A · N^(−α) · α ≈ 0.3

What LLMs are good at

  • Open-ended reasoning over long passages
  • Following instructions in natural language
  • Synthesizing across domains the user didn’t have to specify
  • Code generation, translation, summarization

What they’re bad at

  • Anything requiring precise recall of facts not in their context — they confidently
  • Cheap, high-volume narrow tasks where the per-call cost dominates
  • Calibrated probability outputs (an LLM saying “I’m 70% sure” doesn’t usually mean 70%)
  • Latency-sensitive applications — frontier LLMs run at hundreds-of-ms minimum per call, often seconds
The frontier-LLM landscape today
  • GPT family (OpenAI) — GPT-5 and successors; closed-weight, premium pricing.
  • Claude (Anthropic) — Sonnet and Opus; closed-weight, strong on long context and reasoning.
  • Gemini (Google) — multimodal-first; closed-weight; deep integration with Google’s stack.
  • Llama (Meta) — open-weight; the basis for most fine-tuned and self-hosted production deployments.
  • Qwen (Alibaba) and DeepSeek — open-weight; competitive on many benchmarks at much lower cost than frontier closed models.

Why the production AI stack isn’t all LLM

Production AI systems wrap LLMs in retrieval, reranking, validation, and routing layers. The LLM does the parts only it can do (open-ended reasoning over context); everything else gets handled by specialized components — many of them small models trained on the exact narrow task they’re doing. The long-term cost-and-quality balance favors more of the stack moving to specialized models, not less.

“Bigger is always better” comes from broad-coverage benchmarks — MMLU, BIG-bench, agentic harnesses — where the generalist must handle every possible task. On a narrow task, the generalist pays for capacity it doesn’t need: most of those 70B parameters encode Latin grammar, Python AST shapes, organic chemistry, and a thousand other things irrelevant to “is this document relevant to this query”.

A 0.5B model has just enough capacity for the narrow task plus the linguistic competence to read inputs. Trained on millions of task-specific examples, every parameter does useful work for the task. Effective task-specific capacity ends up higher than the generalist’s, even at 100× smaller total size, because the generalist amortizes parameters across thousands of competing demands.

The economics compound. The 0.5B serves at sub-50ms latency on commodity GPUs at tens of dollars per million inferences instead of thousands. For a workflow doing 1B retrievals per month, that gap is the entire infrastructure budget.

The future of production AI is a constellation of fine-tuned specialists wrapped around frontier LLMs, not one giant LLM doing everything.

Go further

Why pair an LLM with specialized small models?

An LLM has to relearn the task from instructions on every call — it's a generalist that's expensive to run at scale. For high-volume narrow tasks (retrieval, reranking, classification, query rewriting) a small specialized model trained on the exact task runs 10–100× faster at a fraction of the cost, and beats the LLM on the narrow thing it was trained for.

What's a 'frontier' model vs a regular LLM?

Frontier means at-or-near the state of the art in capability — typically the largest models from OpenAI, Anthropic, and Google. Most LLMs in production aren't frontier; they're smaller, fine-tuned, or specialized variants. The distinction matters for cost: frontier models charge 10–100× per token over open-weight or smaller closed models.

What sets the maximum size of input an LLM can handle?

The context window — the maximum number of tokens the model can attend to at once. Modern LLMs span 8K to 1M+ tokens, but quality degrades sharply past the first ~10K even when the nominal window is larger.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord