Hybrid Search

Also known as: hybrid retrieval, lexical + dense, BM25 + embeddings

TL;DR

Hybrid search combines lexical retrieval (BM25) with dense retrieval (embeddings) into one ranked candidate set. Each method catches what the other misses, so the union is more recall-complete than either alone.

Hybrid search runs two retrieval methods over the same corpus and merges their results. The classical pairing is (lexical, keyword-based) plus dense similarity. The two methods are good at different things:

  • BM25 is great when the query contains rare, exact tokens (proper nouns, model numbers, jargon). It misses paraphrases.
  • Dense embeddings capture semantic similarity even when no words overlap (“car insurance” matches “auto coverage”). They can fail on rare exact terms because the embedding model never saw them in training.

Combine them and you cover both failure modes.

HYBRID SEARCHBM25 + Dense → reciprocal-rank fusionUSER QUERY"what is RAG?"LEXICAL · SPARSEBM25tf · idf · query overlapSEMANTIC · DENSEembedding cosineq · d / |q||d|1.d₁₇exact-match: model num12.42.d₄₂rare-token: TF-IDF rar9.83.d₈₁lexical overlap with q7.24.d₅₃keyword: jargon co-occ6.45.d₆₀stemmed token match...5.11.d₄₂paraphrase: same meani0.922.d₂₃semantic neighbor (no 0.883.d₁₇embedded near query ve0.854.d₉₇cross-lingual match...0.815.d₃₅topical similarity...0.78RECIPROCAL-RANK FUSIONscore = Σ 1 / (k + rankᵢ)k = 60 · ranking-only · scale-freeFUSED RANKING1.d₄₂0.0332.d₁₇0.0323.d₂₃0.0164.d₈₁0.0165.d₅₃0.016bothbm25 onlydense onlyA query enters the search system.

How the merge works

Three combination strategies, in order of adoption:

  1. — each method produces a ranked list; each document gets a score per list (with typically 60), and these are summed across methods. Simple, robust, no learned parameters.
  2. Linear weighted sum. Requires picking , which you tune on a held-out eval set. Sensitive to score scales (BM25 scores aren’t bounded; dense cosines are in ).
  3. Learned-fusion — train a tiny model (or just a ) on top of features from both methods. More accurate; more infra.

In practice teams ship RRF as the baseline or skip straight to a reranker on the union.

The k=60 is from the original Cormack-Clarke-Buettcher paper (2009) and is mostly a smoothing constant: it dampens the contribution of top-1 hits relative to top-10 hits, so a document ranked first in BM25 doesn’t dominate the fusion if it’s ranked 50th in dense. Larger k flattens the contribution curve (everyone matters about equally); smaller k sharpens it (top-1 dominates). In practice the choice between k=10 and k=100 moves NDCG@10 by less than half a point on most retrieval evals, which is why nobody tunes it. The constant is robust because RRF is rank-only — the actual score magnitudes never enter the fusion, just the ordering — and that scale-invariance is the point.

BM25 and dense embeddings fail in different directions, which is exactly what makes hybrid work. The union recovers more than either alone because the failure modes are uncorrelated, not because the scores are individually better.

Queries where one leg saves the other
  • “GPT-4o-mini-2024-07-18 pricing” — dense embeddings squash the model identifier; BM25 hits it exactly
  • “auto coverage premium” matching docs about “car insurance rates” — BM25 misses; dense catches the synonym
  • “EC2 t3.micro vs t4g.micro” — BM25 anchors on the instance types; dense disambiguates the comparison intent
  • “how do I cancel” against a docs corpus full of “cancellation policy” pages — dense beats BM25 because of the verb-noun shift
  • A French query against an English corpus where the multilingual encoder bridges; BM25 returns nothing

Why a reranker downstream is still important

Hybrid search improves recall (the relevant document is in the top-N) but doesn’t necessarily improve precision at the top (the most relevant document is at #1). Hybrid surfaces a better candidate set; a reranker reorders that set to put the actually-relevant document first. Production stacks almost always layer a reranker on top of hybrid first-pass.

Go further

How do I pick the BM25/dense weighting?

Start with reciprocal rank fusion — no weighting needed. Move to a learned linear or logistic combiner only after you have an eval set with at least a few hundred labeled queries; tuning a single α on a small set tends to overfit.

If I add a reranker, do I still need hybrid first-pass?

Yes — the reranker can only reorder what first-pass surfaces. Lexical and dense fail in different ways, so dropping one usually shows up as a recall regression on the rare-token or paraphrased slice of your eval.

How does this generalize across languages?

BM25 needs language-specific tokenization and stemming to work well, while a multilingual embedding model handles cross-lingual queries natively. In multilingual stacks, the dense leg often does most of the work and BM25 catches code, identifiers, and proper nouns.

Related articles

Posts on the ZeroEntropy blog that reference hybrid search.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord