A Working Reference for Modern AI
Definitions, formulas, and animated diagrams for the concepts behind LLMs, embeddings, retrieval, agents, and the systems they live in — cross-linked and scannable.
New here? Start with the playbooks — job-shaped recipes that thread back through these concepts as they're needed.
- Activation Function
An activation function is the elementwise nonlinearity sandwiched between the linear layers of a neural network. Without it, the whole network collapses to a single linear map.
- ANOVA
Analysis of variance — the statistical test that asks 'do these groups differ more than within-group noise would predict?' Partitions total variation in the data into a between-group component and a within-group component, then compares them via the F-statistic.
- Backpropagation
Backpropagation is the chain-rule application that computes the gradient of the loss with respect to every parameter in a neural network. A forward pass produces predictions.
- Batch Normalization
Batch normalization standardizes each activation across the batch dimension to zero mean and unit variance, then applies a learned affine transform. Introduced by Ioffe and Szegedy in 2015, it dominated vision for years.
- Batch Size
Batch size is the number of training examples averaged into a single gradient step. Larger batches give cleaner gradients but worse generalization; smaller batches are noisier but regularize implicitly.
- Bayes' Rule
Bayes' rule
is the math of updating beliefs given evidence. Posterior ∝ likelihood × prior. - Beta Distribution
The Beta distribution
is a continuous distribution on and the conjugate prior to the Bernoulli/Binomial. - Bias-Variance Tradeoff
The bias-variance tradeoff is the classical decomposition of prediction error into three additive parts: squared bias, variance, and irreducible noise.
- Brownian Motion
A continuous-time stochastic process with independent Gaussian increments. The continuous-state, continuous-time limit of a random walk — and the foundational object for stochastic calculus, diffusion models, and the noise terms in modern stochastic processes.
- Cohen's d
The standardized mean difference between two groups. Cohen's d expresses the gap between two group means in units of how spread-out each group is — the most-cited effect size for two-sample comparisons.
- Conjugate Prior
A prior distribution is conjugate to a likelihood when multiplying them produces a posterior in the same family as the prior — so Bayesian updates reduce to arithmetic on the prior's parameters instead of an integral.
- Cross-Entropy Loss
Cross-entropy loss is
— the average number of extra nats it costs to encode samples from the true distribution using the model's predicted distribution . - Double Descent
Double descent is the empirical phenomenon where test error, plotted against model size, first goes down (classical regime), then up (peaking near the interpolation threshold), then down again (modern regime).
- Dropout
Dropout randomly zeroes a fraction of activations during training, forcing the network to spread its representations across many redundant paths instead of co-adapting onto a few. It is mostly off at inference.
- Early Stopping
Early stopping halts training when validation loss starts climbing, even though training loss is still falling. It is the cheapest regularizer ever invented — no hyperparameter, no extra compute, no extra parameters.
- Effect Size
The complement to p-values. A p-value tells you whether a difference is unlikely under the null; an effect size tells you how big it is. For multi-group designs, the F-statistic and η² are the workhorses.
- Eigenvalue and Eigendecomposition
Eigenvalues and eigenvectors decompose a square matrix into directions of pure scaling. The resulting spectral decomposition
underpins SVD, PCA, Markov mixing time, and the low-rank circuit analyses used in mechanistic interpretability. - Entropy
Entropy
is the average number of nats (or bits) needed to encode samples from . It is the unit of uncertainty. - Epoch
An epoch is one full pass over the training set. Classic deep learning trains for tens to hundreds of epochs; modern LLM pretraining is sub-1-epoch — every token is seen exactly once. Fine-tuning typically runs 1-10 epochs.
- Feedforward Network
The feedforward network — the MLP — is the per-position sub-layer that sits next to attention in every transformer block. Two linear layers with an activation in between, applied independently to each token's hidden state.
- GELU
GELU is x · Φ(x), where Φ is the standard-normal CDF. A smooth, differentiable-everywhere relative of ReLU that BERT introduced and every major transformer has used since.
- Gradient Clipping
Gradient clipping caps the norm of the gradient before applying the optimizer step, preventing rare but catastrophic large gradients from blowing up training. The modern default is global-norm clipping at threshold 1.0.
- Gradient Descent
Gradient descent is the iterative optimization procedure that powers virtually all of deep learning. Compute the gradient of the loss with respect to parameters, take a small step in the opposite direction, repeat.
- Grokking
Grokking is the training-dynamics phenomenon where a model first memorizes the training set, then much later — often suddenly — learns to generalize to held-out data.
- Hankel Matrix
A matrix whose anti-diagonals are constant — each entry depends only on the sum of its indices,
. The natural data structure for turning a 1D time series into a 2D matrix you can apply SVD to. - KL Divergence
KL divergence
measures how far one distribution is from another, in nats. It is asymmetric, non-negative, and zero only when the two distributions are identical. - Learning Rate
The learning rate is the scalar η in the gradient-descent update — how big a step to take in the direction of the negative gradient. Too high diverges, too low stalls, and getting it right is the single most important hyperparameter in training.
- Markov Chain
A stochastic process where the next state depends only on the current state, not the history that led to it. The 'memoryless' property — encoded in a single transition matrix — turns multi-step prediction into matrix multiplication.
- Matrix Factorization
Writing a matrix as a product of smaller or more structured matrices. SVD, NMF, QR, LU, Cholesky, eigendecomposition — same general idea under different structural constraints. Underlies essentially every low-rank method in modern machine learning.
- Maximum Likelihood Estimation
MLE picks the parameters
that maximize the probability of the observed data under the model — equivalently, that minimize negative log-likelihood. Cross-entropy training is MLE under a categorical model. - Mutual Information
Mutual information
is the reduction in uncertainty about once you observe . It is the symmetric, information-theoretic measure of how much two variables share. - Normal Distribution
The Gaussian
is the bell-curve density . It shows up everywhere because of the central limit theorem. - Optimizer
An optimizer is the wrapper around vanilla gradient descent that decides how each parameter actually gets updated. Adam, AdamW, and SGD-with-momentum are the workhorses.
- Overfitting
Overfitting is the failure mode where a model memorizes its training set instead of learning patterns that generalize. It's the central concern of classical statistical learning.
- Pearson Correlation
Pearson's
measures the strength of a linear relationship between two variables, on . - Principal Component Analysis (PCA)
PCA rotates a dataset to align with its directions of maximum variance, then projects onto the top
components. Computed via SVD of the centered data matrix. - ReLU
ReLU is max(0, x) — pass positive inputs through, clamp negatives to zero. The cheap, sharp nonlinearity that made training deep networks finally work, and the dominant hidden-layer activation from 2012 until transformers switched to GELU.
- Sigmoid
The sigmoid σ(x) = 1/(1 + e⁻ˣ) squashes any real number into the open interval (0, 1). It was the default neural-network nonlinearity for decades and still survives wherever you need a probability or a gate.
- SiLU
SiLU is x · σ(x): the input gated by its own sigmoid. Originally proposed as Swish, now standard in Llama, Mistral, and most modern open-weight transformers. Practically indistinguishable from GELU.
- Singular Spectrum Analysis (SSA)
SSA is the time-series analog of PCA. Embed a 1-D series into a Hankel trajectory matrix, SVD it, group eigentriples into trend / oscillatory / noise components, and reconstruct.
- Singular Value Decomposition (SVD)
— every real matrix decomposes into rotation, axis-aligned stretch, and rotation. The single most-used matrix factorization in ML: powers PCA, LoRA, low-rank attention, embedding quantization, SSA, and the spectral analysis of any linear map. - Softmax
Softmax maps a vector of real numbers to a probability distribution: each output is exp(xᵢ) divided by the sum of exp(xⱼ). It is the function that turns logits into next-token probabilities and attention scores into weights.
- Spearman Correlation
Spearman's
ρis Pearson correlation computed on ranks instead of raw values. It captures any monotone relationship — linear or curved — and is the correct correlation for ranking and retrieval evaluation, where what matters is order. - Tanh
Tanh maps any real number into the open interval (−1, 1). A zero-centered sibling of sigmoid that ruled hidden layers before ReLU, and that still lives in RNN cells, attention temperature tricks, and GELU's tanh approximation.
- Tensor
A tensor is a multidimensional array — the rank-N generalization of scalars (rank 0), vectors (rank 1), and matrices (rank 2). In ML, 'tensor' means an n-dimensional array with a
shape,dtype, anddevice. - Type Systems
A type system is the contract that says 'this variable holds an integer, that function returns a User.' Static type systems (Rust, TypeScript, mypy) catch the contract at compile-time.
- Vector
A vector is an ordered list of numbers — the universal data shape in modern AI. Every embedding, every layer activation, every gradient, every prediction is a vector under the hood.
- Weight Decay
Weight decay is L2 regularization on model parameters: add
λ ||θ||²to the loss to penalize large weights. It biases the optimizer toward simpler functions and is the dominant regularizer in modern LLM training.
- Common Crawl
Common Crawl is the largest open web crawl — roughly 250 billion pages across 100+ monthly snapshots, distributed as WARC files. It's the raw material almost every public LLM corpus is built from.
- Data Augmentation
Synthetic perturbation of training examples to expand a dataset's effective size — paraphrase and back-translation for text, rotation and crop for images, in-batch contrastive views for embeddings.
- Data Contamination
When evaluation data leaks into training data, inflating benchmark scores without improving real capability. Detected via n-gram match against eval sets, log-probability attacks, or membership inference.
- Data Curation
The umbrella discipline of preparing a corpus for training: filtering, deduplication, quality scoring, language ID, and classifier-based selection.
- Data Engineering at Scale
What changes when your dataset doesn't fit on one machine, doesn't fit in RAM, and takes hours per pass. The thresholds where ad-hoc Python stops working: single-file → sharded, RAM → streaming, single-node → distributed.
- Data Formats
Data formats are the contracts between memory and disk: how a structured record turns into bytes that can be read back later. Choice of format determines whether you can scan a TB in a second or an hour, whether schema evolution breaks readers.
- Data Labeling
Human-in-the-loop annotation of training data — crowdsourced (Mechanical Turk, Scale, Surge), expert (domain specialists), and gold-standard sets. Distinct from RLHF preferences.
- Data Mixing
The ratio decisions in a pretraining corpus — what fraction of web vs code vs math vs books vs scientific papers. Second-most-important choice in pretraining after corpus selection itself.
- Dataset Cards
Structured metadata describing a dataset's provenance, license, size, intended use, limitations, and ethical considerations. The HuggingFace dataset-card schema is the de facto standard, and every shipped dataset should have one.
- Deduplication
Removing exact and near-duplicate documents from a training corpus. Exact dedup uses content hashing; near-dedup uses MinHash, SimHash, or embedding-based similarity.
- Distributed Data Processing
Spark, Ray Data, Beam, Dask — the frameworks that turn N nodes into one logical compute. The map-reduce mental model still rules: per-partition compute is free, cross-partition compute requires a shuffle, and shuffle is the dragon.
- FineWeb
FineWeb is a 15-trillion-token English web corpus released by HuggingFace in 2024, distilled from 96 Common Crawl snapshots through an aggressive filtering and deduplication recipe.
- JSONL
JSONL — JSON Lines, also called NDJSON — is one JSON object per line. Brutally simple, ubiquitous in ML datasets and log shipping, friendly to streaming and append-only writes.
- Parquet
Parquet is a columnar on-disk format that has become the only reasonable way to store multi-TB datasets. Rows are grouped into chunks; each chunk stores columns separately, compressed, with statistics.
- Pretraining Corpus
The trillion-token text mixture that an LLM consumes during pretraining. Composition (web, code, books, math, scientific papers) and mixing ratios are the two highest-leverage choices in pretraining.
- Streaming Datasets
When your dataset doesn't fit on local disk, you stream it from object storage as tar or Parquet shards. Sequential reads of large objects are 10-100x faster than random access on S3, so streaming formats.
- Weak Supervision
Programmatic labeling: write rules, heuristics, and labeling functions, aggregate them into noisy labels for a model to denoise. The Snorkel paradigm.
- Web Scraping
The engineering pipeline for harvesting text data from the public web — crawlers, robots.txt, JS rendering, deduplication-as-you-go, rate limits, and politeness.
- Attention
Attention is the mechanism that lets a token in a sequence dynamically read from any other token's representation, weighted by a learned similarity.
- Autoregressive Generation
Autoregressive generation is the token-by-token loop that decoder LLMs use to produce text: predict the next token from everything generated so far, sample, append, repeat.
- Causal Masking
Causal masking is the lower-triangular attention mask that prevents each token from seeing tokens to its right. It is the architectural commitment that makes a transformer autoregressive — the load-bearing difference between encoder and decoder attention.
- Context Rot
Context rot is the empirical degradation of an LLM's effective recall and instruction-following as its context window fills. The canonical case is the U-shaped position bias first quantified by Liu et al. (2023) as 'lost in the middle' — facts near the start and end of a long prompt are used, facts buried in the middle are often ignored — but the phenomenon generalizes to attention dilution and instruction drift across long contexts.
- Context Window
The context window is the maximum number of tokens an LLM can process at once. Modern LLMs span 8K to 1M+, but the effective window — where attention quality stays high.
- Decoder-Only Model
A decoder-only model is a transformer that generates text autoregressively, one token at a time, with causal self-attention so each position only sees prior tokens.
- Encoder Model
An encoder model is a transformer that reads a sequence with bidirectional attention and produces a contextual representation for each token — typically pooled into a single vector.
- Encoder-Decoder Model
An encoder-decoder model is a transformer with two stacks: an encoder reads the input bidirectionally, then a decoder generates the output autoregressively while cross-attending to the encoder's representations.
- FlashAttention
FlashAttention is an I/O-aware attention kernel that tiles the computation in SRAM and fuses the softmax, avoiding the need to materialize the N×N attention matrix in HBM.
- Grouped-Query Attention (GQA)
Grouped-Query Attention shares a single key/value head across a group of query heads, shrinking the KV cache by the group factor with negligible quality loss.
- Hallucination
Hallucination is when an LLM generates a confident-sounding statement that's factually wrong or unsupported by the input. It's the load-bearing failure mode of LLMs in production.
- KV Cache
The KV cache stores the key and value tensors from previous tokens during autoregressive generation, so each new token only computes attention over its own query against cached keys and values — not a full re-computation.
- Large Language Model (LLM)
A large language model is a transformer-based neural network trained on vast text corpora to predict the next token. Modern LLMs (GPT, Claude, Gemini) are general-purpose reasoning engines.
- Layer Normalization
Layer normalization rescales each layer's activations to zero mean and unit variance per token, then applies a learned affine transform. It stabilizes deep transformer training and is what lets modern LLMs reach hundreds of layers without diverging.
- Logits
Logits are the raw, pre-softmax score vector a language model outputs at each position — one real-valued score per vocabulary token. They're the currency of decoding: every sampling strategy, calibration trick.
- Mamba State-Space Model
Mamba is a linear-time sequence model that replaces attention with a selective state-space recurrence. It runs in O(N) instead of attention's O(N²), processes infinite context in constant memory.
- Mechanistic Interpretability
Reverse-engineering neural networks at the level of circuits — small subgraphs of attention heads and MLP neurons that implement specific, identifiable computations.
- Mixture of Experts (MoE)
An architecture that replaces the dense feed-forward layer in a transformer with a sparse routing layer over many expert subnetworks — each token activates only a few experts.
- Multi-Head Attention
Multi-head attention splits the attention computation into
parallel heads, each with its own learned projections. Heads specialize on different relations — syntactic, semantic, positional — and their outputs are concatenated and projected. The default attention pattern in every modern transformer. - Perplexity
Perplexity is the standard intrinsic metric for evaluating language models: the exponentiated average per-token cross-entropy loss on held-out text. Lower is better.
- Positional Encoding
Positional encoding gives a transformer a sense of token order — necessary because raw self-attention is permutation-equivariant and would treat 'dog bites man' and 'man bites dog' identically.
- Pretraining
Pretraining is the initial massive next-token-prediction phase that trains a language model on trillions of tokens of generic text. It's where an LLM acquires its broad capability — grammar, world knowledge, reasoning, code.
- Reasoning Model
A reasoning model is an LLM trained to spend test-time compute on internal chain-of-thought before answering. The post-o1 paradigm: pretraining + SFT + RL on verifier-checkable problems, with hidden 'thinking' tokens as the substrate.
- Residual Connection
A residual connection adds a layer's input to its output, so each block computes an update on top of a running 'residual stream' rather than transforming the representation from scratch.
- RoPE (Rotary Positional Embedding)
RoPE encodes token position by rotating pairs of dimensions in the query and key vectors by an angle proportional to position. The dot product between query and key then becomes a function of their relative position.
- Scaling Laws
Scaling laws are empirical power-law relationships between compute, parameter count, training tokens, and language-model loss. Chinchilla's 2022 result — train roughly 20 tokens per parameter for compute-optimal performance.
- Sliding-Window Attention
Sliding-window attention restricts each token to attend only to the past
tokens (typically 4K-8K) instead of the full context, trading global receptive field for instead of compute. Used in Mistral, Gemma, Phi, and most long-context efficient designs. - Sparse Autoencoders
A wide, sparsely-activated autoencoder trained on transformer activations. The learned dictionary recovers monosemantic features — directions that fire for a single human-understandable concept rather than the polysemantic mush of raw neurons.
- Subword Tokenization (BPE, WordPiece, SentencePiece)
Subword tokenization is the family of algorithms that learn a vocabulary of subword units from a corpus. BPE (byte-pair encoding) merges the most frequent adjacent pairs; WordPiece and Unigram are variants. tiktoken, SentencePiece, and tokenizers are the standard libraries.
- Test-Time Compute
Test-time compute trades inference budget for accuracy by spending more tokens, samples, or search steps per query. Self-consistency, best-of-N, reasoning chains, and tree search are all instances. It's the substrate behind the o1 / R1 reasoning-model paradigm.
- Tokenization
Tokenization is how raw text becomes numerical input for a language model — the input is sliced into tokens (sub-word units, typically 3–5 characters each), each token mapped to an integer ID.
- Transformer
The transformer is the neural-network architecture behind every modern LLM, embedding model, and reranker. Its defining feature is self-attention.
- Audio Embeddings
Audio embeddings map a waveform or spectrogram into a fixed-size vector space where similar-sounding clips land near each other. Wav2Vec 2.0, HuBERT, and BGE-Audio set the modern recipe.
- CLIP
CLIP (Contrastive Language-Image Pretraining, Radford et al. 2021) is a dual-encoder model that embeds images and text into a shared vector space. It is trained contrastively on 400M (image, caption) pairs scraped from the web.
- Diffusion Model
A diffusion model generates images by iteratively denoising pure Gaussian noise. The forward process gradually adds noise to a real image; the reverse process is a learned neural network that removes it step by step.
- Flow Matching
A generative-modeling objective that learns a continuous vector field transporting noise to data along straight or curved probability paths. Generalizes and often replaces diffusion: simpler training, faster sampling, and the substrate behind SD3, Flux, and Veo.
- Image Encoder
An image encoder maps a raw image into a sequence of patch embeddings or a pooled vector. Modern multimodal stacks use Vision Transformer (ViT) encoders that tokenize the image into 16x16 or 14x14 patches.
- Multimodal RAG
Multimodal RAG is retrieval-augmented generation where the query, the documents, or both span multiple modalities — PDFs with figures, screenshots, voice queries, or image-grounded answers.
- OCR (Optical Character Recognition)
OCR converts image regions containing text into machine-readable strings. Classical pipelines (Tesseract, Google Cloud Vision, AWS Textract) detect text regions then recognize them via CNN+LSTM.
- SigLIP
SigLIP (Zhai et al. 2023) replaces CLIP's softmax contrastive loss with a per-pair sigmoid loss, decoupling each (image, text) pair from the rest of the batch.
- Text-to-Image
Text-to-image is the generation capability where a natural-language prompt produces an image. The dominant architecture is a CLIP-conditioned latent diffusion model.
- Vision Transformer (ViT)
The Vision Transformer applies a standard transformer to image patches instead of words. An image is cut into a grid of 16×16 patches, each linearly embedded into a token, fed to a transformer encoder with positional encodings.
- Vision-Language Model (VLM)
A vision-language model is an LLM that can see. Image patch embeddings are projected into the LLM's token space and concatenated with text tokens; the model treats them as a uniform sequence and generates text autoregressively.
- Visual Question Answering
Visual question answering (VQA) is the task of producing a natural-language answer to a question about an image. It is the canonical benchmark for vision-language models because it forces grounding.
- Whisper ASR
Whisper (OpenAI, 2022) is an encoder-decoder transformer for automatic speech recognition trained on 680K hours of weakly-supervised multilingual audio.
- Chain-of-Thought
Chain-of-thought (CoT) prompting asks the model to produce intermediate reasoning steps before its final answer. The intermediate tokens act as a scratchpad.
- Constrained Decoding
Constrained decoding restricts an LLM's next-token distribution to only tokens that keep the partial output valid against a grammar or schema.
- Few-Shot Prompting
Few-shot prompting is the technique of including 2-5 input/output examples in the prompt to demonstrate the desired behavior. It works dramatically better than describing the rule in words because the model picks up on format, edge cases.
- In-Context Learning
In-context learning (ICL) is the empirical phenomenon that LLMs can adapt to new tasks from examples in the prompt — without any weight updates.
- Prompt Caching
Prompt caching is the API-side feature that lets you reuse a model provider's KV cache for a stable prompt prefix across requests. You mark the prefix as cacheable, the provider keeps its KV cache warm.
- Prompt Engineering
Prompt engineering is the practice of writing inputs to an LLM that reliably produce the outputs you want. It includes structure (system prompts, few-shot examples) and reasoning patterns (chain-of-thought).
- Prompt Injection
Prompt injection is adversarial input that hijacks an LLM's instruction-following — making the model treat attacker text as if it came from the developer.
- Prompt Template
A prompt template is a parameterized, reusable prompt — variables filled in at runtime from request data. In production systems, prompt templates are first-class artifacts: versioned, tested, and A/B-deployed like any other code.
- ReAct Prompting
ReAct (Reasoning + Acting) is the prompting pattern where the model alternates between thoughts, tool actions, and observations in a loop. It's the foundational structure behind nearly every modern LLM agent.
- Self-Consistency
Self-consistency samples N independent chain-of-thought reasoning paths and majority-votes the final answer. It's the cheapest test-time-compute trick.
- Structured Output
Structured output is the practice of forcing an LLM to produce machine-parseable output — JSON, XML, or any schema-conforming format — instead of free-form text.
- System Prompt
The system prompt is the privileged instruction channel — separate from user input — that sets a model's overall behavior, persona, and constraints.
- Temperature Sampling
Temperature is a scalar that divides the logits before softmax, controlling how peaked or flat the next-token distribution is. Temperature 0 is greedy decoding (always pick the argmax); higher temperatures sample more diversely.
- Top-p (Nucleus) Sampling
Top-p sampling restricts each step's sampling to the smallest set of tokens whose cumulative probability is at least p. Unlike top-k's fixed cutoff, the nucleus adapts to the distribution's shape.
- Tree-of-Thought
Tree-of-thought (ToT) generalizes chain-of-thought by exploring a search tree of reasoning paths instead of a single linear chain. The model branches into multiple candidate next steps, evaluates them, and backtracks when a branch goes wrong.
- Zero-Shot Prompting
Zero-shot prompting is asking a model to do a task with no examples — only a description of what you want. It works surprisingly well on common tasks because the model's training distribution already contains analogous patterns.
- Agent
An agent is an LLM placed in a perception/decision/action loop — it reads context, picks an action (often a tool call), observes the result, and iterates until the goal is met.
- Agent Guardrails
Agent guardrails are the input/output filters, tool-call validators, and allow-lists that bound what an agent can do and say. Defense-in-depth: layered checks at the prompt boundary, the tool boundary.
- Agent Loop
The agent loop is the execution scaffold that wraps an LLM into an agent: perceive → think → act → observe → repeat. It's the trajectory primitive.
- Agent Memory
Agent memory is how an agent persists information across turns and sessions. Short-term memory lives in the context window; long-term memory lives in an external store (vector DB, structured records, files).
- Agent Orchestration
Agent orchestration is the routing layer that decides which agent or model handles each step. The dominant patterns are workflow orchestration (a deterministic graph of agents) and autonomous orchestration (a supervisor delegating to sub-agents).
- Agentic RAG
Agentic RAG is RAG where the model decides what to retrieve, reformulates queries, and iterates — instead of a single pre-baked query going to the index.
- Function Calling
Function calling is the structured-API mechanism that providers (OpenAI, Anthropic, Google) expose for tool use: you give the model a JSON schema describing each function, and the model responds with a typed call request the runtime can execute.
- MCP (Model Context Protocol)
MCP is Anthropic's open standard for connecting LLMs to tools and data sources. An MCP server exposes a catalog of tools, resources, and prompts; any MCP-aware client can use them.
- Multi-Agent Systems
Multi-agent systems use multiple specialized agents — different roles, tools, or models — coordinating to solve a task. Patterns range from a coordinator dispatching to specialists to debate setups where agents argue toward a better answer.
- Planning and Decomposition
Planning and decomposition is the agent pattern of breaking a complex goal into ordered sub-tasks and executing them, instead of trying to one-shot the whole thing.
- Reflection and Critique
Reflection is the agent self-evaluation pattern: produce an answer, evaluate it against the goal or known criteria, refine if needed. It catches errors that one-shot generation misses, at the cost of extra tokens and latency.
- Tool Use
Tool use is the pattern where an LLM emits a structured request to call an external function — a search API, a code runner, a database query — and the runtime executes it and returns the result.
- Approximate Nearest Neighbor (ANN)
ANN algorithms — HNSW, IVF, ScaNN — find the closest vectors to a query without scanning all of them. They give up a small slice of recall in exchange for orders-of-magnitude speedup.
- BM25
BM25 is a classical lexical retrieval algorithm that scores documents by how well their term frequencies match a query, with corrections for document length and rare-term importance.
- Chunking
Chunking is the process of splitting long documents into smaller passages that fit cleanly inside an embedding model's context window — and that align with semantic boundaries so each chunk is independently retrievable.
- Dense Retrieval
Dense retrieval finds documents by comparing their embeddings to a query embedding via cosine or dot product, served from an approximate-nearest-neighbor index.
- FAISS
FAISS (Facebook AI Similarity Search) is the C++ library for efficient similarity search and clustering of dense vectors. It implements the canonical ANN algorithms — flat, IVF, HNSW, PQ, and combinations — with CPU and GPU backends.
- First-Pass Retrieval
First-pass retrieval is the initial wide-net stage of a production search pipeline that surfaces a few hundred candidate documents per query out of millions. It optimizes for recall and speed; precision-at-the-top is left to a reranker downstream.
- Grounded Generation
Grounded generation is the pattern of forcing an LLM's output to be derivable from a supplied set of retrieved sources, with citations attached. The standard defense against hallucination in RAG pipelines.
- HNSW
HNSW (Hierarchical Navigable Small World) is the dominant graph-based ANN algorithm. A multi-layer proximity graph supports log-time approximate search by greedy walks at each layer.
- Hybrid Search
Hybrid search combines lexical retrieval (BM25) with dense retrieval (embeddings) into one ranked candidate set. Each method catches what the other misses, so the union is more recall-complete than either alone.
- Inverted Index
An inverted index maps each term to the list of documents (and positions) where it appears. The classical data structure behind keyword search — sub-millisecond lookups over billions of documents and the substrate every BM25 implementation builds on.
- IVF Clustering
IVF (Inverted File Index) is the cluster-based ANN algorithm: K-means partitions the corpus into a few thousand cells, each query is matched to its nearest centroids, then exhaustively searched within only those cells.
- Parent-Document Retrieval
Parent-document retrieval splits the index granularity from the context granularity: embed and retrieve over small chunks for precision, but return the larger parent document to the LLM. Fixes the chunk-boundary problem in RAG.
- Product Quantization
Product quantization (PQ) compresses a vector by splitting it into M sub-vectors and quantizing each independently against a small codebook learned via K-means.
- Query Expansion
Augmenting the original query with synonyms, paraphrases, or hypothetical answers before retrieval. The classical IR technique that LLMs reinvented as HyDE. Sometimes a clean win, sometimes drift that hurts more than it helps.
- Query Rewriting
Query rewriting transforms a user's raw query into one or more reformulated versions tuned for retrieval — expanding abbreviations, decomposing multi-part questions, or fixing the syntax expected by an underlying search API.
- RAG (Retrieval-Augmented Generation)
RAG is the pattern of retrieving relevant documents and feeding them into an LLM as context, so the LLM can answer with grounded, citeable information instead of guessing from its training data.
- Reciprocal Rank Fusion
Reciprocal rank fusion (RRF) is the boring, parameter-free way to merge multiple ranked lists into one. Sum
across lists with — and you have the default fusion method in production hybrid-search stacks. - Semantic Search
Semantic search is the umbrella term for retrieval that goes beyond surface keyword matching to capture meaning — most often via dense embeddings, but also via learned-sparse models, query rewriting, and reranking.
- Sparse Retrieval
Sparse retrieval is the family of methods that represent queries and documents as high-dimensional sparse vectors over a vocabulary — including BM25 and modern learned-sparse models like SPLADE and uniCOIL.
- SPLADE
SPLADE (SParse Lexical AnD Expansion) is a learned sparse retrieval model: a transformer produces a sparse term-weight vector over the BERT vocabulary for each query and document, scored by dot product on an inverted index.
- TF-IDF
TF-IDF weighs a term by how often it appears in a document (term frequency) times how rare it is across the corpus (inverse document frequency).
- 2-norm (Euclidean Length)
The 2-norm of a vector is its Euclidean length — the square root of the sum of squared components. Normalizing a vector to 2-norm = 1 makes it a unit vector.
- Bi-Encoder
A bi-encoder embeds the query and the document separately into vectors, then compares them with a dot product or cosine. Fast and cacheable — the basis of every dense retrieval system.
- Contrastive Learning
The training paradigm behind almost every modern embedding model. Pull positive pairs (query, relevant document) close in vector space; push negatives far apart.
- Cosine Similarity
Cosine similarity is the cosine of the angle between two vectors — equivalently, their dot product divided by the product of their magnitudes. It's the standard way to compare embedding vectors for relevance.
- Cross-Lingual Retrieval
Cross-lingual retrieval is finding documents in one language that answer a query in another. A multilingual embedding or reranker maps text from any language into the same vector space, so a French query can retrieve English documents.
- Curse of Dimensionality
In high-dimensional spaces, distance and similarity behave counterintuitively — random points become nearly equidistant, volume concentrates near the surface of any region, and naive nearest-neighbor search loses much of its discriminative power.
- Embedding
An embedding is a fixed-size vector representation of a piece of text (or image, audio, etc) that places semantically similar inputs near each other in a high-dimensional space. The basis of dense retrieval, semantic search, and most modern RAG.
- Embedding Quantization
Quantization compresses each dimension of an embedding from 32-bit floats down to smaller representations — typically int8 (4× smaller) or single-bit binary (32× smaller) — to shrink index size and speed up similarity search.
- Hard-Negative Mining
The training-data trick that makes embedders actually competitive: source negatives that look similar to the positive but aren't actually relevant.
- In-Batch Negatives
The simplest way to scale contrastive training: treat every other example in the same batch as a negative for the current positive pair. Free supervision, no extra forward passes. The reason embedder training cares about batch size.
- InfoNCE Loss
InfoNCE is the contrastive loss objective behind almost every modern embedder. For each positive pair, softmax-normalize the similarities of (positive, negatives) and treat it as N+1-way classification.
- Johnson-Lindenstrauss Lemma
A 1984 result that says you can reduce a high-dimensional vector to a much lower dimension via random projection while approximately preserving pairwise distances. The mathematical reason aggressive dimension truncation works for embeddings.
- Matryoshka Representation Learning (MRL)
Matryoshka representation learning trains an embedding model so that prefixes of its output vector are themselves valid embeddings — letting you truncate from 2048 to 1024 to 512 dimensions at inference time without retraining.
- Multimodal Embeddings
An embedding space shared across modalities — text, image, audio, video — so a query in one modality retrieves content in another. CLIP-style contrastive training is the dominant recipe. Doing it well is far harder than doing it at all.
- Multiple Negatives Ranking Loss
MNRL is a contrastive ranking loss that scores a query against one positive and many negatives, then trains the positive to score highest. Popularized by sentence-transformers, it's the workhorse loss for fine-tuning bi-encoders on labeled pairs.
- Orthogonality Concentration
In high dimensions, two random vectors are almost always nearly orthogonal — their cosine similarity concentrates sharply around 0. The reason untrained embeddings give noise and why training has to actively fight the geometry.
- Cascade Rerankers
A cascade reranker stacks multiple rerankers from cheap-and-fast to expensive-and-accurate, with each stage filtering candidates before passing a smaller set to the next.
- ColBERT
A late-interaction retrieval architecture: encode each token of query and document into its own vector, score pairs by maxsim. Sits between bi-encoder (one vector per text, fast) and cross-encoder (full attention, accurate but slow).
- Cross-Encoder
A cross-encoder takes a (query, document) pair as a single joint input and produces one relevance score. It captures token-level interactions between query and document — much more accurate than embedding them separately, at higher cost per pair.
- Instruction-Following Reranker
An instruction-following reranker accepts an explicit instruction or context alongside the (query, document) pair, and reranks accordingly. Lets you inject business rules, user preferences, or domain context per call without retraining.
- Listwise Reranking
Listwise reranking processes the entire candidate list as a single input and produces a permutation, rather than scoring each (query, document) pair independently. More expressive but more expensive — typically powered by an LLM.
- Pairwise Reranker
A reranker that scores by comparing two candidate documents head-to-head —
model(query, doc_A, doc_B) → which is more relevant. More accurate than pointwise (transitivity arbitrage, calibration-free) butat inference. - Pointwise Scoring
Pointwise scoring evaluates each (query, document) pair independently, producing one relevance score per pair. The dominant pattern for cross-encoder rerankers because it's simple, parallelizable, and produces calibrated scores.
- Reranker
A reranker is a second-stage retrieval model that takes a candidate set from first-pass retrieval and reorders it by relevance. It's how production search systems get high precision without paying full LLM cost on every query.
- Score Calibration (Rerankers)
A calibrated reranker outputs scores whose absolute value is meaningful — 0.8 means roughly 80% relevance consistently across queries and domains, so you can threshold and filter reliably. Most rerankers are not calibrated.
- BEIR Benchmark
BEIR is a heterogeneous benchmark of 18 retrieval datasets across domains — biomedical, news, finance, scientific QA, fact-checking — designed to test zero-shot retrieval. The standard reference for whether a retriever generalizes beyond MS MARCO.
- Calibration-Discrimination Analysis
When you compare two scoring systems on the same items — index score vs reranker score, model score vs ground-truth grade — the residuals from a regression line tell you where they disagree.
- Citation Extraction
Citation extraction maps each claim in an LLM-generated answer back to the supporting span in the source documents. Distinct from generation — often a small specialized model — and what makes RAG outputs auditable.
- Classical Test Theory
The 100-year-old psychometric toolkit that ML eval research mostly ignored. Decompose every observed score into
(true score + error). - Cohen's Kappa
Cohen's
— observed agreement minus chance agreement, normalized. The standard inter-annotator agreement metric. Raw % agreement is misleading on imbalanced classes; kappa is the honest version. - Cronbach's Alpha
— the aggregate internal-consistency reliability number that falls out of CTT. - Eval Set Quality
A practical diagnostic checklist for is this benchmark actually any good? Layer four measurement-theory tools — CTT for per-item pathologies, Cronbach's
for aggregate reliability. - F1 Score
The F1 score is the harmonic mean of precision and recall — a single number that punishes lopsided performance. Standard for classification, rare in retrieval, where ranked metrics like NDCG@K are usually the better choice.
- Faithfulness
Faithfulness is whether each claim in an LLM's answer is actually supported by the retrieved context. Distinct from relevance and from accuracy.
- Graded Relevance LLM Judge
An LLM-as-judge configured to emit graded relevance — typically a 0-3 scale (irrelevant / marginal / relevant / highly relevant) rather than a binary yes/no.
- Isotonic Regression
Isotonic regression fits a non-parametric monotone function from raw scores to calibrated probabilities. More flexible than Platt scaling — handles any monotone miscalibration shape — at the cost of needing more labels and being prone to overfitting at the score-distribution tails.
- LLM-as-judge
Using a frontier LLM to score outputs — relevance, faithfulness, answer quality — at scale where human raters can't keep up. Powerful for graded labels, but introduces position bias, verbosity bias, model bias.
- MAP (Mean Average Precision)
Mean Average Precision averages precision at each rank where a relevant document appears, then averages across queries. The older sibling of NDCG — comparable for binary relevance, weaker for graded relevance.
- MRR (Mean Reciprocal Rank)
Mean Reciprocal Rank is the average of 1/rank across queries, where rank is the position of the first relevant document. Heavily front-loaded — only the top result really matters.
- MS MARCO
MS MARCO is Microsoft's web-search dataset of ~1M Bing queries paired with passages and human relevance judgments. The standard training corpus for retrievers and rerankers, the source of every modern dense retriever.
- MTEB
Massive Text Embedding Benchmark — a public benchmark covering 50+ datasets across retrieval, classification, clustering, and more. The de facto leaderboard for embedding models, despite some well-documented limitations in its retrieval portion.
- NDCG@K
Normalized Discounted Cumulative Gain at K is a ranking quality metric that rewards relevant documents appearing high in the result list, with logarithmic discounting for lower positions. The standard top-of-list quality metric for rerankers.
- Platt Scaling
Platt scaling fits a logistic sigmoid on top of a model's raw scores to produce calibrated probabilities. Cheap, two parameters, the standard first-resort calibration method for SVMs, classifiers, and uncalibrated rerankers.
- Precision@K
Precision@K is the fraction of the top-K returned documents that are relevant. The classical IR metric retrieval moved away from in favor of NDCG, but still the right choice when every position in the result list carries equal weight.
- Recall@K
Recall@K is the fraction of queries whose relevant document appears anywhere in the top-K results. It measures whether retrieval found the right document at all — the silent ceiling on every downstream stage.
- Statistical Significance in Retrieval Evals
Retrieval evals report metrics like NDCG@10 averaged across queries — but each query is one sample, and most public benchmarks have hundreds, not thousands. A '+0.5 NDCG' difference is often noise.
- Catastrophic Forgetting
When fine-tuning a pre-trained model on a new task erases capabilities the base model originally had. The classical neural-network failure mode that dominates fine-tuning practice — and the reason LoRA, mixed-data training, and rehearsal exist.
- Constitutional AI
Constitutional AI replaces human pairwise preference labels with a written constitution — a list of natural-language rules — and uses an LLM to critique and revise its own outputs against those rules.
- DPO (Direct Preference Optimization)
DPO is the closed-form alternative to RLHF: optimize the LLM directly on pairwise preferences, with no separate reward model and no reinforcement learning loop. Simpler, more stable, and the default alignment recipe in 2026.
- Elo Score
Elo is a continuous skill rating recovered from pairwise win/loss outcomes — originally for chess, now repurposed in retrieval to convert pairwise document preferences into pointwise relevance scores.
- Ensemble Learning
Combining the predictions of multiple models — bagging, boosting, stacking — to get a single output more accurate than any individual member.
- Entropy Regularization
Adding an entropy bonus to a training objective to keep the model's output distribution from collapsing too sharply. Used in policy-gradient RL (PPO, SAC, A3C) to encourage exploration.
- Fine-Tuning
Fine-tuning is the process of further training a pre-trained model on task-specific or domain-specific data. It's how a generalist becomes a specialist.
- Information Bottleneck
The information bottleneck principle frames learning as a compression problem: find a representation T of input X that throws away every bit of X that is not informative about the target Y. Formally, maximize I(T; Y) while minimizing I(X; T).
- Instruction Tuning
Instruction tuning is fine-tuning a pre-trained language model on (instruction, response) pairs so it learns to follow directions. The step that turns 'GPT-base' into 'GPT-instruct'.
- Knowledge Distillation
Training a small (student) model to mimic the outputs of a larger (teacher) model — getting most of the teacher's quality at a fraction of the cost. The basis of essentially every production deployment of small specialized models.
- Learning-Rate Scheduler
A learning-rate scheduler is the function that changes the learning rate over training. Linear warmup followed by cosine decay is the modern default; WSD (warmup-stable-decay) is the 2024 successor. Picking the schedule is as load-bearing as picking the peak LR.
- LoRA and Parameter-Efficient Fine-Tuning (PEFT)
LoRA injects tiny low-rank adapter matrices into a frozen base model and trains only those — typically ~1% of the parameters. Results match or beat full fine-tuning on most narrow tasks at a fraction of the memory and storage cost.
- Pairwise Preference
Pairwise preference is the supervision signal where, for a query and two candidate documents, an annotator (or LLM) picks which one is more relevant.
- PPO (Proximal Policy Optimization)
A clipped policy-gradient algorithm that keeps each update close to the previous policy via a clip on the importance-sampling ratio. The standard RL optimizer for RLHF — Schulman et al. 2017, OpenAI — and the algorithm GPT-3.5/4 and Llama-2 were aligned with.
- Process Reward Model
A process reward model (PRM) scores each intermediate step of a reasoning chain, not just the final answer. It's the supervision signal that powers post-o1 reasoning models — credit assignment along the trajectory, not only at the end.
- Reward Modeling
Training a model that predicts a scalar quality or preference score for an LLM's output. The backbone of RLHF — the reward model is what the LLM optimizes against.
- RLHF (Reinforcement Learning from Human Feedback)
RLHF is the classical alignment recipe: train a reward model from human pairwise preferences, then fine-tune the language model with PPO to maximize that reward.
- Supervised Fine-Tuning (SFT)
SFT is plain supervised learning applied to a pre-trained language model: given (input, target) pairs, train the model to produce the target. The umbrella term for any fine-tuning that's not preference-based — distinct from RLHF and DPO.
- Synthetic Data Generation
Using a frontier LLM to generate training data for a smaller specialized model. The dominant data-creation method in 2026 — every modern open-weight instruct model and most production-tuned rerankers train on synthetic data, including zerank-2.
- Thurstone Model
A statistical model from 1927 that converts pairwise comparisons into continuous quality scores. Foundational to chess Elo ratings, food preference studies, and modern reranker training via the zELO methodology.
- zELO
ZeroEntropy's training methodology for rerankers and embeddings. Frontier LLMs vote pairwise on document relevance; a Thurstone fit recovers continuous Elo-style scores; the scores become regression targets for a small specialized model.
- Arithmetic Intensity
Arithmetic intensity is FLOPs per byte read from memory. Combined with the hardware's compute-to-bandwidth ratio it determines whether a kernel is memory-bound (intensity below the ridge) or compute-bound (above).
- AWQ — Activation-Aware Weight Quantization
INT4 weight-only quantization that protects the salient weight channels — the ones multiplied by large activations — by absorbing a per-channel scale into the weights before rounding.
- CUDA Programming
NVIDIA's parallel-computing platform for GPUs. A C++ extension plus a runtime that exposes the GPU as a SIMT machine: thousands of threads grouped into warps, blocks, and grids, with explicit control over a tiered memory hierarchy.
- Data Parallelism
Replicate the model on every GPU, shard the batch across replicas, and synchronize gradients with an allreduce after each backward pass. The simplest distributed-training pattern, and the default for any model that fits on a single device.
- FP8
Two IEEE-style 8-bit float variants, E4M3 (1 sign, 4 exponent, 3 mantissa) and E5M2 (1 sign, 5 exponent, 2 mantissa). E4M3 has higher precision and narrower range, used for forward activations and weights.
- FSDP (Fully Sharded Data Parallel)
FSDP shards parameters, gradients, and optimizer state across data-parallel ranks instead of replicating them. Each rank holds only 1/N of the weights at rest and gathers full layers on the fly during forward and backward.
- GGUF and the K-Quant Family
The file format and quantization scheme that powers
llama.cpp— the de-facto local-inference stack for LLMs on commodity hardware. GGUF embeds tokenizer, chat template, and quantized weights in a single mmap-able artifact. - GPTQ — Hessian-Based Post-Training Quantization
Layer-by-layer 4-bit weight quantization that minimizes layer-output reconstruction error using a Hessian computed from a small calibration set.
- GPU Kernel Authoring
Writing custom GPU kernels — in CUDA C++, Triton, CUTLASS, or ThunderKittens — when the off-the-shelf library is leaving performance on the table.
- GPU Memory Hierarchy
Modern GPUs have three relevant memory levels: HBM (slow, abundant), SRAM (fast, tiny), and registers (fastest, tiniest). HBM bandwidth is roughly 1-3 TB/s; on-chip SRAM bandwidth is closer to 20 TB/s.
- Gradient Accumulation
Run multiple micro-batches sequentially, summing their gradients into a buffer, before applying a single optimizer step. Lets you simulate a large effective batch on memory-constrained hardware. The standard trick for hitting target batch sizes when a single batch won't fit on the GPU.
- Gradient Checkpointing
Trade compute for memory by recomputing forward activations during the backward pass instead of storing them. Roughly 5x memory savings on activations at a cost of ~30% slower training.
- Inference Graph Compilation
Capture a model's computation as a static graph, optimize it (operator fusion, constant folding, attention specialization, kernel selection), and emit a compiled artifact that runs without Python overhead. torch.compile, TensorRT-LLM, ONNX Runtime.
- Kernel Fusion
Combining multiple GPU operations into a single CUDA kernel call so that intermediate tensors live in registers or shared memory instead of round-tripping through HBM.
- Mixed-Precision Training
Train with bf16 or fp16 activations and weights instead of fp32, while keeping master weights and optimizer accumulations in fp32 for numerical stability.
- Model FLOPs Utilization (MFU)
MFU is achieved FLOPs divided by theoretical peak FLOPs — the headline efficiency metric for whether you're actually using the GPU. Realistic targets in 2026: 40-60 percent during pretraining is good, 50 percent-plus is excellent.
- Model Quantization
Compressing the weights of a trained model from fp16 / bf16 down to int8, int4, fp8, or fp4 representations to fit larger models on smaller hardware and increase inference throughput.
- MXFP4
The Open Compute Project's microscaling 4-bit float. Blocks of 32 elements share a single 8-bit power-of-2 scale (E8M0); each element is a 4-bit micro-float (E2M1). Effective storage: 4.25 bits per weight.
- NF4 — NormalFloat 4-Bit Quantization
A 4-bit weight format with 16 levels placed at the equiquantiles of the standard normal distribution rather than uniformly. Trained-network weights are approximately
, so spending bits where the mass actually lives. - NVFP4
NVIDIA's variant of microscaling FP4, introduced with Blackwell. Blocks of 16 elements (vs MXFP4's 32) with an E4M3 FP8 scale (vs MXFP4's power-of-2 E8M0), plus an optional outer FP32 scale across multiple blocks.
- PagedAttention
PagedAttention is the KV-cache memory manager behind vLLM. It treats the KV cache like an OS treats process memory — fixed-size blocks of 16 tokens, mapped through a per-sequence block table.
- Pipeline Parallelism
Pipeline parallelism splits a model's layers across GPU stages and feeds micro-batches through them in an assembly-line schedule. GPipe (2018) introduced the basic idea; 1F1B (PipeDream, 2019) reduced its memory footprint.
- PyTorch Internals
How PyTorch actually executes a forward pass: a
torch.Tensoris a thin Python object wrapping a storage, view, dtype, and device; every op routes through a C++ dispatcher that picks the right backend kernel for the (dtype, device, layout) tuple. - Quantization-Aware Training (QAT)
Training a model with quantization simulated in the forward pass so the weights co-adapt to low-precision rounding. Recovers quality that post-training quantization loses, at the cost of a fine-tuning run. The standard recipe for sub-4-bit deployment.
- Tensor Parallelism
Tensor parallelism splits each individual matrix multiplication across multiple GPUs — column-split or row-split with an all-reduce — so weights too large for one GPU's memory still produce one combined result.
- Caching Strategies
Three layers of caching for LLM-driven systems: exact-match (request → response), prompt-prefix (KV cache reuse for shared prefixes), and semantic (similar-query reuse via embeddings). Each helps different production workloads in different ways.
- Context Compression
Context compression shrinks a retrieval result set or agent trace down to just the spans the LLM actually needs, before sending it to the model. Crucial for long-running agentic systems where context blows past the model's effective attention window.
- Continuous Batching
The vLLM-style scheduling trick where requests join and leave a batch in-flight, dynamically. Massively improves GPU utilization for variable-length generation compared to naive static batching, and is the default in every modern LLM serving stack.
- Cost per Token
The economics primitive of LLM-driven systems. Per-token pricing — input and output, with output usually 3-5× input — is what makes a feature financially viable or not. Production decisions are dominated by this number more than any other.
- Drift Detection
Monitoring distributional shift in inputs, outputs, or intermediate signals of a retrieval or LLM pipeline. The discipline that catches 'the metric is silently moving' before users notice.
- LanceDB
LanceDB is an open-source vector database built on the Lance columnar format — append-only, Rust core, columnar-on-disk. It is the only OSS vector DB that handles incremental indexing without rebuilds.
- Latency Tail (P95, P99)
P50 is the median; P95 and P99 are the 95th and 99th percentile latencies. The tail is what wakes oncall, not the median — a 200ms median with a 5s P99 means 1% of users see your system as broken.
- LLM Observability
The operational discipline of monitoring LLM-driven systems: tracing per-call inputs/outputs, eval-in-prod against held-out sets, drift detection on inputs and outputs, latency and cost percentiles.
- MDX
MDX is Markdown extended with JSX — write prose that imports components and renders them inline. The format that powers most modern docs sites (Next.js, Astro, Docusaurus).
- PII Redaction
Detecting and removing personally-identifiable information from LLM inputs and outputs — names, emails, phone numbers, addresses, IDs. A classic small-model task: high-volume, narrow, latency-sensitive, with structured target output.
- Pydantic
Pydantic is the runtime type-validation library that has quietly become a hard dependency of the Python ML ecosystem. You declare a
BaseModel, get validation, JSON-schema export, and a v2 Rust core for free. - Semantic Cache
A semantic cache returns a cached LLM response when an incoming query is similar enough — by embedding cosine similarity — to a previous query, rather than requiring exact-string match.
- Speculative Decoding
Use a small 'draft' model to predict the next several tokens, then have the big 'target' model verify them in a single forward pass. The standard latency-reduction trick for LLM inference — typically 2-4× faster generation at the same output quality.
- Throughput (Tokens per Second)
Tokens per second per GPU is the production planning metric for LLM serving. Throughput scales with batch size up to a memory-bound ceiling, then plateaus. The key number for capacity planning, autoscaling, and unit-economic analysis.
- Vector Database
A vector database is a database whose primary index is an approximate-nearest-neighbor structure over high-dimensional vectors. The system substrate for production dense retrieval — it wraps an ANN algorithm (HNSW, IVF, PQ) with persistence, replication, metadata filtering, and incremental updates.
- vLLM Serving
vLLM is the dominant open-source LLM serving framework. Its core innovations — PagedAttention for KV-cache memory management, continuous batching for throughput, and prefix caching for prompt reuse.
