Dense Retrieval

Also known as: embedding-based retrieval, vector retrieval, neural retrieval

TL;DR

Dense retrieval finds documents by comparing their embeddings to a query embedding via cosine or dot product, served from an approximate-nearest-neighbor index.

Dense retrieval is the family of retrieval methods that score (query, document) similarity by embedding both into the same vector space and comparing them with cosine or dot product. Concretely:

At index time, run every document through an embedding model and store the resulting vectors in an ANN index (HNSW, IVF, ScaNN).
At query time, embed the query through the same (or a query-side) encoder.
Search the ANN index for the K nearest vectors. Those are your candidates.

The encoder pattern is almost always a bi-encoder — query and document embedded independently, so document vectors are computed once at index time and cached forever.

What dense retrieval gets right

Lexical retrieval ( BM25 , TF-IDF) only matches surface tokens. “Car insurance” and “auto coverage” share no words; lexically they’re unrelated. Dense retrieval, trained on (query, relevant document) pairs, learns that they’re paraphrases and embeds them near each other. This is what lets you ask a question in natural language and get back documents that don’t echo your phrasing. RAG pipelines depend on it.

Semantic matching beyond surface tokens is the entire reason dense retrieval exists. Everything else is implementation detail.

What it gets wrong

A single fixed-size vector is a lossy compression of an entire document. Things that compress poorly:

Where dense retrieval struggles

Rare exact tokens — model numbers, identifiers, citations. The embedding model never saw them in training, so they get lumped into a generic “stuff” region of the space.
Long documents — multi-topic pages forced into one vector lose per-section signal.
Negation and scoped quantifiers — “documents that mention X but not Y” is hard to express in a vector dot product.
Aggregation queries — “the fastest reranker” requires ordering, which similarity can’t do.

The fix isn’t usually to abandon dense retrieval. It’s hybrid search (combine with sparse retrieval for rare-token coverage) and reranking (a cross-encoder over the top-K to recover fine-grained ordering).

The BEIR benchmark (Thakur et al., 2021) revealed an uncomfortable result: out-of-domain, BM25 frequently beats single-vector dense retrievers. The reason is generalization. A dense embedder is trained on a specific distribution (MS MARCO web passages, typically); when deployed on a domain it never saw — biomedical text, legal documents, code — its embedding geometry doesn’t transfer. Bi-encoder embeddings encode a model’s training prior, not universal semantic similarity.

BM25, in contrast, is statistical — it reweights based on the corpus you give it, not the corpus it was trained on. So when you point dense retrieval at SciFact or NFCorpus without fine-tuning, it underperforms BM25 on raw NDCG. The fix is hybrid retrieval plus a strong cross-encoder reranker, which has been the production answer since 2022.

Modern instruction-tuned embedders (voyage-3, openai text-embedding-3, zembed-1) and ColBERT-style late-interaction methods close most of the gap, but the lesson stands: pure dense retrieval was oversold in 2022, and BM25 keeps earning its place as a cheap robustness layer.

Implementation shape

A typical production dense retrieval system has three components:

Embedder — the bi-encoder model. zembed-1, OpenAI text-embedding-3, Voyage-4, Cohere, etc.
Vector index — HNSW (default), IVF, or ScaNN. Often inside a vector DB (Pinecone, Weaviate, Qdrant, pgvector) or a self-hosted FAISS.
Ingest pipeline — chunking, embedding, upserting. Usually the operational complexity nobody warns you about.

At scale, dense retrieval costs more at ingest (embedder forward pass per document, plus ANN index build) and less at query time (one embedder call plus a graph walk) compared to BM25, which is cheap on both axes.

Where dense retrieval sits in the stack

Dense retrieval is the first stage of two-stage retrieval. It surfaces a few hundred semantic candidates fast; a reranker reorders them precisely. Both halves are needed — the candidates the reranker can’t see are invisible to it, and the ordering inside the candidate set is what users actually consume.

The vector geometry produced by an embedder is specific to the trained weights. The “meaning” of a particular direction in 1024-dim space is whatever that model learned during training; a different model — even one trained on overlapping data — produces a different, incompatible geometry.

Concretely: index a corpus with text-embedding-3-large, then query with voyage-3-large. Cosine similarity between the query and corpus vectors is essentially uncorrelated with relevance — you’re comparing points in two unrelated coordinate systems. The retrieval result is randomized.

The exception is asymmetric embedders, where the same model is run with different prompt prefixes for queries vs documents (query: ... vs passage: ...). The two encoder passes share weights but produce slightly different geometries, calibrated by training to align. As long as you use the matched query/document mode of the same model, this is fine; it’s only swapping models entirely that breaks things.

Re-embedding 100M documents with a modern embedder is a non-trivial undertaking. At ~250 tokens/document and a 7B-class embedder running 5K tokens/sec/GPU on H100, you’re at ~50K docs/hour/GPU; full re-embedding takes 2000 GPU-hours, or about $4-6K at spot rates.

The build of the ANN index on top is comparable. HNSW construction is roughly O(N log N) graph operations with high constants; for 100M vectors at 1024 dimensions the index build runs into the tens of CPU-hours per replica. Sharding helps but adds complexity.

The implication is that embedding-model upgrades are expensive infrastructure events, not casual upgrades. Production teams often defer them by 6-12 months unless the new model offers a step-function quality improvement on workloads they can measure. Matryoshka embeddings reduce the pain — you can index at full dimension and serve at truncated dimension, allowing dimension-side experimentation without re-embedding.

Go further

When should I use dense retrieval over BM25?

When your queries paraphrase rather than quote. 'How do I reset my password?' doesn't share words with 'I forgot my login credentials' but they're the same intent — dense handles that, BM25 doesn't. The flip side: BM25 wins on rare exact tokens (model numbers, codes), so production stacks usually run hybrid.

Hybrid search Sparse retrieval BM25

Why is dense retrieval still 'first-pass' rather than the whole pipeline?

A bi-encoder squeezes each document into one fixed vector — enough for similarity, not enough for fine-grained ranking. The top-3 from a vector index are usually plausible but rarely in the right order. A cross-encoder reranker over the dense top-100 fixes ordering for orders less compute than reranking the whole corpus.

Bi-encoder Reranker Cross-encoder

What's the cost structure compared to BM25?

Dense retrieval moves cost from query time to index time. BM25 is index-cheap and query-cheap; dense is index-expensive (every document needs a forward pass through the embedder, plus an ANN graph build) but query-cheap thereafter. For static corpora dense pays for itself; for fast-moving corpora the ingest pipeline matters.

Embedding ANN First-pass retrieval

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs