BM25

Also known as: Okapi BM25, lexical search, keyword retrieval

TL;DR

BM25 is a classical lexical retrieval algorithm that scores documents by how well their term frequencies match a query, with corrections for document length and rare-term importance.

BM25 (Best Matching 25) is a probabilistic relevance ranking function from the early 1990s. Given a query and a document, it produces a score based on three signals:

Term frequency — does the document contain the query’s words a lot? More is better, but with diminishing returns (saturating).
Inverse document frequency — are the query’s words rare across the corpus? Rare words are more discriminative; common words (“the”, “and”) get downweighted.
Document length normalization — longer documents naturally contain more words, so BM25 normalizes for that.

The full formula is a few lines of code, and reference implementations exist in every search library (Elasticsearch, OpenSearch, Lucene, Tantivy, MeiliSearch, etc).

Why BM25 still matters

Dense embeddings get the headlines, but BM25 has stubborn advantages:

Exact match wins — for queries with rare proper nouns, codes, model numbers, or domain jargon, BM25 finds the document with the literal token, where embeddings often miss.
Out-of-vocabulary robustness — BM25 doesn’t care if it’s never seen a word before; embeddings have to extrapolate from training distribution.
Cheap to compute — inverted index lookups are sub-millisecond even at billions of documents.
No training required — you just index and query.

The hybrid play

Modern production retrieval almost always combines BM25 with dense retrieval — see hybrid search . The two methods make different mistakes (BM25 over-rewards lexical match; embeddings over-reward semantic gist), so combining them with a learned weighting yields better recall than either alone. Then a reranker cleans up the merged candidate set.

Common gotchas

Stopword handling — old BM25 setups strip stopwords aggressively. For some queries that destroys signal (“to be or not to be” becomes empty). Modern setups keep stopwords and rely on IDF to downweight them.
Stemming — too aggressive and you collapse distinct words; too little and “running” doesn’t match “run”. Pick per-language.
Tokenization — your tokenizer choices (CJK, code, identifiers) make a much bigger difference than tuning the BM25 hyperparameters k1 and b.

A neural retriever that can’t find the document containing the literal model number is worse than the dumb decades-old algorithm that can.

The TF saturation term in BM25 is . As tf grows large, this approaches rather than growing linearly. The saturation point is controlled by k1, conventionally set around 1.2-2.0.

The intuition: a document that mentions “Paris” 100 times isn’t 100× as relevant as one that mentions it once. There’s a quick rise in confidence on the first few mentions, then diminishing returns — and beyond a point, more mentions are noise (a glossary, a navigation menu, a long page that happens to contain the term repeatedly). Linear TF would over-reward keyword stuffing; saturated TF damps it.

This is also where BM25 quietly differs from its predecessor TF-IDF, which used raw or log-normalized TF. Robertson’s BM25 reformulation came out of the 2-Poisson model of term occurrence in relevant vs non-relevant documents — the saturation curve falls out of the math when you treat term frequency as a Poisson-like signal. The fact that the resulting formula also matches human intuition about “diminishing returns from repeated mentions” is part of why BM25 has aged well across decades and corpus types.

Both are empirical knobs that emerged from TREC tournaments in the 1990s, when Robertson and colleagues iteratively tuned the formula on retrieval benchmarks. k1 controls the steepness of the TF saturation; b controls how aggressively to penalize long documents.

k1 typical range: 1.2-2.0. Lower values (k1=0.5) saturate fast — the second mention of a term barely adds anything. Higher values (k1=3.0) saturate slowly — TF behaves more linearly. The default 1.2 is a compromise tuned on long-form prose.

b typical range: 0.0-1.0. b=0 disables length normalization entirely, treating long documents the same as short ones. b=1 fully normalizes, dividing by document length. The default 0.75 is the empirical sweet spot — long documents get a discount but not a 100% one.

The honest truth about both knobs: their default values are fine for almost every workload. Tuning them on a held-out set sometimes lifts NDCG by 1-2 points, but rebuilding tokenization, stemming, or the analysis chain typically lifts it by 5-15 points. Hyperparameter tuning is usually procrastination from the harder upstream work.

Three structural reasons. First, exact-match. Embedding models compress documents into fixed-size vectors. Tokens that appeared rarely in training (model numbers, citation identifiers, proprietary code names, internal jargon) get under-represented in the embedding space. BM25 has no such bias — it cares only about the inverted index, not about whether the term appeared in pretraining.

Second, latency at scale. A well-built inverted index serves billion-document corpora at sub-millisecond latency. Dense ANN can match this only with non-trivial engineering — sharding, caching, quantization. For many production systems the operational simplicity of BM25 is a feature, not a limitation.

Third, out-of-domain robustness. Embedding quality degrades when the deployment domain shifts away from the training distribution. BM25’s quality is bounded by the tokenizer, the corpus, and the IDF estimates — none of which suffer from training-distribution drift. A new product launches, a new technical term is coined, BM25 just works the moment you re-index.

The right framing isn’t “BM25 vs neural” — it’s “BM25 plus neural”. Production retrieval is almost always hybrid: BM25 for lexical robustness, dense retrieval for semantic recall, fused with reciprocal rank fusion or a learned weighting, then a reranker on the top-K to clean up.

Go further

Why hasn't a neural model just replaced BM25 outright?

Embedding models still trip over rare exact tokens (model numbers, citations, proper nouns) because those tokens were under-represented in training. BM25's inverted index also runs at sub-millisecond latency over billions of docs, which dense ANN can match only with non-trivial engineering.

Hybrid search Embedding First-pass retrieval

How do BM25 and dense scores get fused into one ranking?

Reciprocal rank fusion is the dominant default — each engine's rank is mapped to 1/(k+rank) and summed. Linear weighting is more accurate but sensitive to score scales since BM25 is unbounded while cosine is in [-1, 1].

Hybrid search end-to-end (playbook) Hybrid search Score calibration

Does a reranker still help if my BM25 is well-tuned?

Yes — BM25 optimizes for lexical overlap, not semantic relevance, so its top-10 ordering is rarely the same as the user's ground truth. A cross-encoder reranker over the BM25 top-100 typically lifts NDCG@10 noticeably, even when first-pass recall is already high.

Reranker Cross-encoder Evaluating a reranker on your own data (playbook)

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs