Three structural reasons. First, exact-match. Embedding models compress documents into fixed-size vectors. Tokens that appeared rarely in training (model numbers, citation identifiers, proprietary code names, internal jargon) get under-represented in the embedding space. BM25 has no such bias — it cares only about the inverted index, not about whether the term appeared in pretraining.
Second, latency at scale. A well-built inverted index serves billion-document corpora at sub-millisecond latency. Dense ANN can match this only with non-trivial engineering — sharding, caching, quantization. For many production systems the operational simplicity of BM25 is a feature, not a limitation.
Third, out-of-domain robustness. Embedding quality degrades when the deployment domain shifts away from the training distribution. BM25’s quality is bounded by the tokenizer, the corpus, and the IDF estimates — none of which suffer from training-distribution drift. A new product launches, a new technical term is coined, BM25 just works the moment you re-index.
The right framing isn’t “BM25 vs neural” — it’s “BM25 plus neural”. Production retrieval is almost always hybrid: BM25 for lexical robustness, dense retrieval for semantic recall, fused with reciprocal rank fusion or a learned weighting, then a reranker on the top-K to clean up.