Embedding

Also known as: text embedding, dense embedding, neural embedding, vector embedding

TL;DR

An embedding is a fixed-size vector representation of a piece of text (or image, audio, etc) that places semantically similar inputs near each other in a high-dimensional space. The basis of dense retrieval, semantic search, and most modern RAG.

EMBEDDING · LOOKUP TABLE → POINT IN SPACEEach token becomes a fixed-length vector. Similar tokens land near each other.LOOKUP TABLE · ROW PER TOKENEMBEDDING SPACE · 2-D PROJECTIONTOKENVECTOR (d = 8 shown of 768)catdogkittenpuppyhorsekingqueenprince···|V| ≈ 50 000 rows · dim 768 in productione( CAT ) ∈ ℝ⁸same shape for every token — that's the contractdim 1dim 2ANIMALSROYALTY / PEOPLECOLORSVERBS OF MOTIONFOODANIMALSROYALTY / PEOPLECOLORSVERBS OF MOTIONFOODA vocabulary of tokens. Each row stores a learned vector.STEP 1 · BUILD A LOOKUP TABLE

An embedding maps a piece of text (or any input) to a fixed-size vector — typically 768, 1024, 2560, or 4096 dimensions of 32-bit floats. The defining property: inputs with similar meaning land near each other in vector space, even if they share no surface tokens. “How do I reset my password?” and “I forgot my login credentials” sit close together; “How do I reset my password?” and “What’s the capital of Belgium?” sit far apart.

COSINE NEIGHBORHOOD · WHY NEAR = RELATEDTokens with similar meaning land inside the same ring.EMBEDDING SPACEdim 1dim 2COS θ = 0.90COS θ = 0.70kingQUERY · qqueenmonarchprincethroneknightpalaceroyalapplehighwayviolinCOSINE SIMILARITYcos(θ)=q·d‖q‖‖d‖NEAR · COS ≥ 0.90MID · 0.70 – 0.90FAR · COS ≲ 0.40Projection to 2D for visualization; real embedding spaces are 768–4096 dimensions.STAGE · A SLICE OF EMBEDDING SPACE

This proximity is what makes possible: embed the query, embed every document offline, return the document vectors closest to the query vector. With , , and other index structures, this scales to billions of documents per query at sub-100ms latency.

How embeddings get trained

Modern embedding models are trained on (query, relevant document) pairs from search logs, Q&A sites, click data, and LLM-synthesized examples. The objective is : pull query and relevant docs close, push away random or hard-negative docs. The result generalizes — the model embeds text it has never seen and still places it sensibly. Recent models go further by graded relevance from a instead of training on binary “related / not” labels, which lifts NDCG-style metrics disproportionately.

Embeddings turn “is this text similar?” into a single dot product. Everything in the dense-retrieval stack — ANN indexes, hybrid fusion, reranker distillation — is a downstream consequence of that compression.

Trained embeddings concentrate most of their signal in the direction of the vector, not its magnitude. Two paraphrases of the same sentence will point the same way but may have slightly different norms (depending on length, punctuation, the encoder’s pooling); cosine similarity ignores that magnitude entirely. Euclidean distance, by contrast, conflates direction and magnitude, so longer documents with the same meaning end up “farther” from a short query than shorter ones. Most embedding models are explicitly trained with a cosine objective (or equivalently, L2-normalize their outputs), which makes cosine the correct metric — using Euclidean afterward is essentially measuring noise.

What you tune at inference time

A modern embedding model gives you knobs at query time without retraining:

  • Dimensions (truncation): output the full 2560-dim vector, or truncate to 1024 / 512 / 128. Tradeoff: smaller dims = cheaper index + faster search, but lower accuracy. See .
  • : store each dimension as 8-bit integer (4× smaller) or 1 bit (32× smaller). Same tradeoff curve.
  • input_type: “query” vs “document” — small asymmetry that reflects search shape (queries are short and interrogative; documents are long and declarative).

Storage and dimensions

A 2048-dim float32 vector is 8 KB. A million-document index is 8 GB before any index structure on top. and dimension truncation matter at scale: the same index in int8 is 2 KB per vector, and 256-dim int8 is 256 bytes — manageable across billions.

They’re hardware-friendly multiples of 64 or 128 — the natural granularity for SIMD instructions and tensor cores. A dot product over a 1024-dim vector decomposes cleanly into 16 chunks of 64 floats. The other constraint is the underlying transformer’s hidden size: BERT-base is 768, BERT-large is 1024, some LLM-derived embedders are 4096. Pooling and projection layers can shift this, but most production embedders expose the hidden dim directly.

Where embeddings break

Where bi-encoder embeddings struggle
  • Rare-token queries — model numbers, drug brand names, customer IDs the embedder never saw in training
  • Long-document recall where the supporting span is buried at position 3000 of a 4000-token chunk
  • Aggregation or counting queries (“how many models support Matryoshka?”) that require structured reasoning, not similarity
  • Cross-lingual queries against a monolingual encoder
  • Negation — “not a dependency” and “is a dependency” tend to embed close together

For domain-jargon-heavy corpora the fix is rarely “fine-tune the embeddings.” It’s usually “add to catch the rare exact tokens” — recovers most of that gap with no training. A on top catches what the bi-encoder still leaves on the table.

Go further

Why are reranker-distilled embeddings disproportionately better?

Standard contrastive training only teaches 'related vs unrelated' — a binary signal. Distilling from a cross-encoder transfers continuous, graded relevance into the embedding space, so the bi-encoder learns 'how relevant', not just 'whether relevant'. That graded signal is exactly what NDCG-style metrics reward.

How small can I shrink the vectors before quality cracks?

With Matryoshka-trained models you can usually go from 2560→512 dims with single-digit-percent loss, and quantize the rest to int8 on top. Below ~256 dims or 1-bit quantization, recall starts to fall off a cliff for most production corpora.

Where do embeddings stop being enough?

Whenever you need ranking accuracy at the top — single-vector similarity is too coarse to separate the top-3 candidates. Production stacks layer a cross-encoder reranker over the embedding first-pass to recover the precision a bi-encoder can't deliver alone.

Related articles

Posts on the ZeroEntropy blog that reference embedding.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord