Large Language Model (LLM)

Also known as: LLM, foundation model, frontier model

TL;DR

A large language model is a transformer-based neural network trained on vast text corpora to predict the next token. Modern LLMs (GPT, Claude, Gemini) are general-purpose reasoning engines.

A large language model (LLM) is a neural network trained to predict the next token in a sequence, given everything before. Train one on trillions of tokens of books, code, web pages, and papers and the network develops a startlingly general capability for language understanding, reasoning, and generation.

The “large” matters. Small language models existed for decades; what makes LLMs qualitatively different is scale. At ~10B+ parameters trained on ~1T+ tokens, models cross thresholds where they handle tasks they were never explicitly trained on — translation, code generation, multi-step reasoning, instruction following — purely as a side effect of next-token prediction at scale.

How they work, in one paragraph

LLMs are almost universally transformer architectures. Input text is broken into tokens , each token is embedded into a high-dimensional vector, and stacked transformer layers progressively transform those vectors via attention so each position can read information from any other. The final layer produces a probability distribution over the next token; sampling repeatedly produces text. Training is supervised next-token prediction on huge corpora, then optionally fine-tuning for specific behaviors and RLHF for alignment.

What LLMs are good at

Open-ended reasoning over long passages
Following instructions in natural language
Synthesizing across domains the user didn’t have to specify
Code generation, translation, summarization

What they’re bad at

Anything requiring precise recall of facts not in their context — they hallucinate confidently
Cheap, high-volume narrow tasks where the per-call cost dominates
Calibrated probability outputs (an LLM saying “I’m 70% sure” doesn’t usually mean 70%)
Latency-sensitive applications — frontier LLMs run at hundreds-of-ms minimum per call, often seconds

The frontier-LLM landscape today

GPT family (OpenAI) — GPT-5 and successors; closed-weight, premium pricing.
Claude (Anthropic) — Sonnet and Opus; closed-weight, strong on long context and reasoning.
Gemini (Google) — multimodal-first; closed-weight; deep integration with Google’s stack.
Llama (Meta) — open-weight; the basis for most fine-tuned and self-hosted production deployments.
Qwen (Alibaba) and DeepSeek — open-weight; competitive on many benchmarks at much lower cost than frontier closed models.

Why the production AI stack isn’t all LLM

Production AI systems wrap LLMs in retrieval, reranking, validation, and routing layers. The LLM does the parts only it can do (open-ended reasoning over context); everything else gets handled by specialized components — many of them small models trained on the exact narrow task they’re doing. The long-term cost-and-quality balance favors more of the stack moving to specialized models, not less.

“Bigger is always better” comes from broad-coverage benchmarks — MMLU, BIG-bench, agentic harnesses — where the generalist must handle every possible task. On a narrow task, the generalist pays for capacity it doesn’t need: most of those 70B parameters encode Latin grammar, Python AST shapes, organic chemistry, and a thousand other things irrelevant to “is this document relevant to this query”.

A 0.5B model has just enough capacity for the narrow task plus the linguistic competence to read inputs. Trained on millions of task-specific examples, every parameter does useful work for the task. Effective task-specific capacity ends up higher than the generalist’s, even at 100× smaller total size, because the generalist amortizes parameters across thousands of competing demands.

The economics compound. The 0.5B serves at sub-50ms latency on commodity GPUs at tens of dollars per million inferences instead of thousands. For a workflow doing 1B retrievals per month, that gap is the entire infrastructure budget.

The future of production AI is a constellation of fine-tuned specialists wrapped around frontier LLMs, not one giant LLM doing everything.

Go further

Why pair an LLM with specialized small models?

An LLM has to relearn the task from instructions on every call — it's a generalist that's expensive to run at scale. For high-volume narrow tasks (retrieval, reranking, classification, query rewriting) a small specialized model trained on the exact task runs 10–100× faster at a fraction of the cost, and beats the LLM on the narrow thing it was trained for.

Reranker Fine-tuning Knowledge distillation

What's a 'frontier' model vs a regular LLM?

Frontier means at-or-near the state of the art in capability — typically the largest models from OpenAI, Anthropic, and Google. Most LLMs in production aren't frontier; they're smaller, fine-tuned, or specialized variants. The distinction matters for cost: frontier models charge 10–100× per token over open-weight or smaller closed models.

Knowledge distillation zELO

What sets the maximum size of input an LLM can handle?

The context window — the maximum number of tokens the model can attend to at once. Modern LLMs span 8K to 1M+ tokens, but quality degrades sharply past the first ~10K even when the nominal window is larger.

Context window Tokenization Context compression

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs