Why FORMULA and does the constant matter?

The 60 is from Cormack-Clarke-Buettcher (2009) and is a smoothing constant. Larger FORMULA flattens the contribution curve so top-1 doesn't dominate when it's an outlier in the other list; smaller FORMULA sharpens it. NDCG@10 typically moves less than half a point between FORMULA and FORMULA, which is why nobody tunes it. The fusion is rank-only — magnitudes never enter — and that scale-invariance is the whole reason RRF travels.

Reciprocal Rank Fusion

Q: Why not just normalize scores and add them?

Linear combination requires picking a weight FORMULA and a normalization scheme, both of which interact with the score distributions of every leg. BM25 scores are unbounded; cosine similarities live in FORMULA; learned rerankers can output anything. Tune FORMULA on a small eval and you overfit; ship with min-max normalization and one outlier query collapses the leg. RRF sidesteps the entire fragility surface by ranking-only fusion.

Also known as: RRF, rank fusion

TL;DR

Reciprocal rank fusion (RRF) is the boring, parameter-free way to merge multiple ranked lists into one. Sum across lists with — and you have the default fusion method in production hybrid-search stacks.

Reciprocal rank fusion (RRF) merges several ranked lists into one. For a document that appears in lists indexed by at rank :

The canonical constant is , from Cormack, Clarke, and Buettcher (2009). Sort all candidates by their RRF score, descending. That’s the entire algorithm.

The intuition

Two properties fall out of the shape:

High-rank documents in any single list dominate. A document at rank 1 in BM25 contributes , a document at rank 10 contributes , and a document at rank 100 contributes . The decay is slow but real — top-3 hits matter, top-50 hits are mostly noise.
Ties between lists fall to the document with the best worst-rank. If document is rank 1 in list and rank 100 in list , while document is rank 5 in both, then wins. The fusion rewards consistency across legs more than excellence in any single leg.

That second property is the one that earns RRF its place: it’s exactly what you want from hybrid search , where a document showing up in both the lexical and the dense leg is stronger evidence of relevance than a document that’s #1 in one and absent from the other.

Why it’s the production default

What RRF gives up — and why that's a feature

No score normalization. BM25 scores aren’t bounded; dense cosines are in ; SPLADE scores have their own scale. RRF ignores all of this — the order is the only signal.
No tunable parameters. is the answer for almost every stack. Nothing to fit on a held-out set, nothing to overfit, nothing to drift when your retrievers change.
No fragility to score-distribution shift. Re-train your dense retriever and the score scale moves; under linear fusion you’d need to re-tune . Under RRF, nothing changes.
Reproducible and auditable. Two people implementing RRF independently get bit-identical outputs. Linear combination implementations diverge on normalization details.

RRF wins because it gives up the right things. Rank-only fusion throws away score magnitudes — and score magnitudes are exactly the noisy, leg-specific, hard-to-calibrate signal you don’t want fusion to depend on.

Compared to the alternatives

Linear combination (). Requires score normalization (min-max, z-score, or sigmoid) and a tuned . Sensitive to outliers in either leg’s score distribution. Fragile across deployments.
CombSUM and CombMNZ. Sum normalized scores (CombSUM) or sum-times-count-of-non-zero (CombMNZ). Predate RRF; same fragility to score scale; rarely used today.
Learned fusion. Train a small model on (query, doc, leg-features) → relevance. More accurate when you have enough labels; more infra; can over-fit. The next step up from RRF, not the next-next.

Where RRF doesn’t help

When all your legs are noisy in the same direction — e.g., two dense retrievers fine-tuned on the same upstream data — fusion barely improves recall. RRF assumes the failure modes are uncorrelated; if they’re not, you’re paying double-cost for marginal gain. The diagnostic is the per-leg MRR overlap: if both legs surface the same top-10 for most queries, fusion is structurally redundant.

Rewrite the contribution as . As , the curve flattens — every rank contributes roughly , so a document at rank 1 and a document at rank 1000 are barely distinguishable. As , the curve sharpens to — pure reciprocal rank, where rank 1 absolutely dominates rank 2 (a 2× ratio).

sits in the middle. The ratio between rank 1 and rank 10 is (a 15% premium for being best); the ratio between rank 1 and rank 100 is . That’s a soft but persistent gradient — enough to reward top hits, gentle enough that a single rogue list with the right document at rank 50 still gets that document into the fused top-K.

The choice is not principled. It’s empirical, robust, and parameter-free. That trio is why nobody has displaced it in fifteen years.

Yes — the sum is over arbitrarily many ranked lists. Production stacks routinely fuse three or more: BM25 + dense + a learned sparse leg like SPLADE , or hybrid + a domain-specific structured-search leg. Each leg contributes independently; the fusion is associative and commutative.

The diminishing-returns intuition: each new leg helps if its top-K mostly contains documents the existing legs ranked below their own top-K. A fourth dense leg trained on the same upstream data adds nearly nothing. A leg over a structured field (titles, code identifiers) typically adds the most because its failure mode is the most distinct.

Go further

Why and does the constant matter?

The 60 is from Cormack-Clarke-Buettcher (2009) and is a smoothing constant. Larger flattens the contribution curve so top-1 doesn't dominate when it's an outlier in the other list; smaller sharpens it. NDCG@10 typically moves less than half a point between and , which is why nobody tunes it. The fusion is rank-only — magnitudes never enter — and that scale-invariance is the whole reason RRF travels.

Hybrid search MRR

Why not just normalize scores and add them?

Linear combination requires picking a weight and a normalization scheme, both of which interact with the score distributions of every leg. BM25 scores are unbounded; cosine similarities live in ; learned rerankers can output anything. Tune on a small eval and you overfit; ship with min-max normalization and one outlier query collapses the leg. RRF sidesteps the entire fragility surface by ranking-only fusion.

Score calibration BM25

When should I move past RRF?

When you have a held-out eval with a few hundred labeled queries and a real signal that fusion (not first-pass recall, not the reranker) is the bottleneck. Then a learned listwise fusion, a cascade reranker, or just running a strong cross-encoder on the union starts to pay. Until you hit that bar, RRF is the boring answer that's also the right one.

Cascade rerankers Listwise reranking

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs