Sparse Retrieval

Also known as: lexical retrieval, learned sparse retrieval, SPLADE, uniCOIL

TL;DR

Sparse retrieval is the family of methods that represent queries and documents as high-dimensional sparse vectors over a vocabulary — including BM25 and modern learned-sparse models like SPLADE and uniCOIL.

Sparse retrieval is the umbrella for retrieval methods whose query and document representations are sparse vectors over a vocabulary — most entries zero, a few non-zero entries weighted by some scoring function. The dimensionality is the vocabulary size (tens to hundreds of thousands), but only the few terms actually present have non-zero weight.

SPARSE RETRIEVALTerms, postings, scores, rank.1 · QUERYsplit into terms2 · POSTING LISTSterm → docs that contain it3 · ACCUMULATEsum tf · idf per doc4 · TOP-Ksort by scorepasswordidf0.82resetidf0.55D14tf×3D27tf×1D55tf×1|posting| = 3D14tf×2D27tf×2|posting| = 2D14Resetting a forgotten pa…pwdrstΣ1.86D27Account recovery FAQpwdrstΣ1.43D55Password complexity policypwdrstΣ0.75Inverted index: most documents contribute zero, so the work scales with posting-list length, not corpus size.TOKENIZE · LOOKUP · SCORE · RANK

This is the natural representation for an : each non-zero dimension corresponds to a posting in some term’s list, and the score is a dot product between the sparse query vector and the sparse document vector — which the inverted index computes by intersecting postings lists.

The classical members

and are sparse retrieval. The non-zero entries are the document’s terms, weighted by some function of term frequency, document frequency, and length. The weights are derived from corpus statistics — no learning involved.

This is the engine behind Elasticsearch, OpenSearch, Lucene, Tantivy, MeiliSearch, and every “keyword search” you’ve ever used.

Learned-sparse models

The interesting recent development is learned sparse retrieval: put a transformer encoder in front, and use it to produce sparse vectors that don’t directly mirror the document’s surface tokens. The leading examples:

Learned-sparse families
  • SPLADE — sparse vector over BERT vocabulary, max-pooled and sparsified. Strongest learned-sparse model on BEIR.
  • uniCOIL — single weight per term; index-friendly, fast at serve time.
  • DeepImpact — predicts impact scores directly; trained for top-k retrieval.
  • TILDE / TILDEv2 — ultra-fast inference via document-side expansion only.
  • BGE-M3 sparse head — multilingual learned-sparse alongside dense and ColBERT-style heads in a single model.

The killer feature: term expansion. A document about authentication that never says the word “password” can still get a non-zero “password” entry from the model, so a “password reset” query reaches it via the inverted index. This closes a chunk of the gap with dense retrieval while keeping the speed and exact-match strength of sparse.

Learned sparse retrieval keeps the inverted index unchanged. Only the postings get smarter — and that’s enough to close most of the gap with dense retrieval on technical corpora.

Why sparse hasn’t been replaced

is the headline technology, but sparse keeps earning its place for three reasons:

  1. Rare-token recall. Embedding models systematically underweight tokens they saw rarely in training. Sparse retrieval has no such bias — a SKU is just a token like any other.
  2. Latency and memory. Inverted indexes serve sub-millisecond queries with mature compression; vector ANN is faster than it used to be but still typically more expensive per query.
  3. Interpretability. A sparse score decomposes into “this term contributed X” — easy to debug. A dense cosine is a black box.

Where sparse loses

  • Paraphrases without overlap. “Auto insurance” vs “car coverage” share zero terms, and pre-learned-sparse retrieval can’t bridge them. SPLADE’s term expansion mostly fixes this; classical BM25 doesn’t.
  • Cross-lingual. A French query against an English corpus is hopeless without translation; multilingual dense models handle it natively. See .
  • Long-tail meaning. Negation, scope, and conjunction live in semantic space, not term space.

The standard recipe

Run sparse and dense in parallel, fuse with reciprocal rank fusion or a learned weighting, then rerank the union. That’s — and the reason production retrieval has settled there is that sparse and dense fail on disjoint slices of queries.

The arithmetic of the inverted index is built around the assumption that most document-query pairs have zero contribution. For each query term, you walk the posting list of documents that contain that term, accumulate a score, and prune aggressively (WAND, BMW, MaxScore algorithms) using the fact that documents not in any query term’s posting list contribute nothing.

Dense vectors break this. Every dimension is non-zero for every document, so there is no “skip this document because it has no overlap” pruning. Every query touches every document — which is why dense retrieval requires ANN data structures (HNSW, IVF, product quantization) instead of inverted indexes. The data structures solve different problems: inverted indexes exploit sparsity for sub-millisecond exact retrieval; ANN graphs exploit geometric locality for sub-millisecond approximate retrieval. There is no unified data structure that does both well — which is why production hybrid pipelines run two indexes side by side.

Three regimes where BM25 alone wins:

  1. Corpora with strong exact-match priors. Legal citation lookup, package-name search in code, SKU lookup in e-commerce. The model adds noise; BM25’s term-exact scoring is exactly what you want.

  2. Tiny corpora (under 10K documents). The cost of training a dense or learned-sparse model exceeds the gain, and BM25’s zero-training-cost makes it the dominant choice.

  3. Cold-start systems. No labeled training data available; BM25 ships day-one with reasonable quality. By the time you have enough usage data to train an embedder, you have enough usage data to evaluate whether the upgrade is worth it.

For everything else — and that is almost everything else in production RAG — hybrid sparse plus dense plus reranker is the default, and arguing about which sparse method is “best” matters less than getting a strong cross-encoder reranker on top.

Go further

How is learned-sparse different from BM25?

BM25 weights are derived from corpus statistics. Learned-sparse models (SPLADE, uniCOIL) put a transformer in front and produce learned term weights — and crucially, term expansions: a query for 'reset password' gets expanded with 'forgot', 'login', 'credentials' before hitting the index. The inverted-index machinery is unchanged; only the postings get smarter.

When does sparse beat dense?

Whenever the query depends on a token the embedder underrepresents — model numbers, SKUs, gene names, legal citations, package names, anything rare and exact. Embedding models squash rare tokens into a generic region of vector space; sparse retrieval treats them as first-class. The gap shows up most on technical and code corpora.

Should I just run hybrid and stop thinking about it?

Yes for most production stacks. The two methods fail in different ways and the union is robust. The only case where you'd run sparse-only is when latency or memory budget rules out a vector index — and even then, learned-sparse with expansions narrows the gap to dense considerably.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord