Embedding

Also known as: text embedding, dense embedding, neural embedding, vector embedding

TL;DR

An embedding is a fixed-size vector representation of a piece of text (or image, audio, etc) that places semantically similar inputs near each other in a high-dimensional space. The basis of dense retrieval, semantic search, and most modern RAG.

An embedding maps a piece of text (or any input) to a fixed-size vector — typically 768, 1024, 2560, or 4096 dimensions of 32-bit floats. The defining property: inputs with similar meaning land near each other in vector space, even if they share no surface tokens. “How do I reset my password?” and “I forgot my login credentials” sit close together; “How do I reset my password?” and “What’s the capital of Belgium?” sit far apart.

This proximity is what makes dense retrieval possible: embed the query, embed every document offline, return the document vectors closest to the query vector. With HNSW , IVF , and other ANN index structures, this scales to billions of documents per query at sub-100ms latency.

How embeddings get trained

Modern embedding models are trained on (query, relevant document) pairs from search logs, Q&A sites, click data, and LLM-synthesized examples. The objective is contrastive : pull query and relevant docs close, push away random or hard-negative docs. The result generalizes — the model embeds text it has never seen and still places it sensibly. Recent models go further by distilling graded relevance from a cross-encoder instead of training on binary “related / not” labels, which lifts NDCG-style metrics disproportionately.

Embeddings turn “is this text similar?” into a single dot product. Everything in the dense-retrieval stack — ANN indexes, hybrid fusion, reranker distillation — is a downstream consequence of that compression.

Trained embeddings concentrate most of their signal in the direction of the vector, not its magnitude. Two paraphrases of the same sentence will point the same way but may have slightly different norms (depending on length, punctuation, the encoder’s pooling); cosine similarity ignores that magnitude entirely. Euclidean distance, by contrast, conflates direction and magnitude, so longer documents with the same meaning end up “farther” from a short query than shorter ones. Most embedding models are explicitly trained with a cosine objective (or equivalently, L2-normalize their outputs), which makes cosine the correct metric — using Euclidean afterward is essentially measuring noise.

What you tune at inference time

A modern embedding model gives you knobs at query time without retraining:

Dimensions (truncation): output the full 2560-dim vector, or truncate to 1024 / 512 / 128. Tradeoff: smaller dims = cheaper index + faster search, but lower accuracy. See MRL / Matryoshka .
Quantization : store each dimension as 8-bit integer (4× smaller) or 1 bit (32× smaller). Same tradeoff curve.
input_type: “query” vs “document” — small asymmetry that reflects search shape (queries are short and interrogative; documents are long and declarative).

Storage and dimensions

A 2048-dim float32 vector is 8 KB. A million-document index is 8 GB before any index structure on top. Quantization and dimension truncation matter at scale: the same index in int8 is 2 KB per vector, and 256-dim int8 is 256 bytes — manageable across billions.

They’re hardware-friendly multiples of 64 or 128 — the natural granularity for SIMD instructions and tensor cores. A dot product over a 1024-dim vector decomposes cleanly into 16 chunks of 64 floats. The other constraint is the underlying transformer’s hidden size: BERT-base is 768, BERT-large is 1024, some LLM-derived embedders are 4096. Pooling and projection layers can shift this, but most production embedders expose the hidden dim directly.

Where embeddings break

Where bi-encoder embeddings struggle

Rare-token queries — model numbers, drug brand names, customer IDs the embedder never saw in training
Long-document recall where the supporting span is buried at position 3000 of a 4000-token chunk
Aggregation or counting queries (“how many models support Matryoshka?”) that require structured reasoning, not similarity
Cross-lingual queries against a monolingual encoder
Negation — “not a dependency” and “is a dependency” tend to embed close together

For domain-jargon-heavy corpora the fix is rarely “fine-tune the embeddings.” It’s usually “add BM25 to catch the rare exact tokens” — hybrid search recovers most of that gap with no training. A reranker on top catches what the bi-encoder still leaves on the table.

Go further

Why are reranker-distilled embeddings disproportionately better?

Standard contrastive training only teaches 'related vs unrelated' — a binary signal. Distilling from a cross-encoder transfers continuous, graded relevance into the embedding space, so the bi-encoder learns 'how relevant', not just 'whether relevant'. That graded signal is exactly what NDCG-style metrics reward.

Knowledge distillation zELO training methodology Cross-encoder

How small can I shrink the vectors before quality cracks?

With Matryoshka-trained models you can usually go from 2560→512 dims with single-digit-percent loss, and quantize the rest to int8 on top. Below ~256 dims or 1-bit quantization, recall starts to fall off a cliff for most production corpora.

MRL / Matryoshka Embedding quantization Johnson-Lindenstrauss lemma

Where do embeddings stop being enough?

Whenever you need ranking accuracy at the top — single-vector similarity is too coarse to separate the top-3 candidates. Production stacks layer a cross-encoder reranker over the embedding first-pass to recover the precision a bi-encoder can't deliver alone.

Reranker Bi-encoder Hybrid search end-to-end (playbook)

← All concepts

Posts on the ZeroEntropy blog that reference embedding.

Jun 24, 2025

What is a reranker and do I need one?

Learn what a reranker is and how it boosts precision in RAG pipelines—surface the right results faster, reduce hallucinations, and improve LLM output quality.

Mar 18, 2026

Bi-Encoders vs Cross-Encoders

Why production search uses two models, not one — and what happens when you pair them.

The best AI teams build with ZeroEntropy models

Book Demo View docs