Reciprocal Rank Fusion

Also known as: RRF, rank fusion

TL;DR

Reciprocal rank fusion (RRF) is the boring, parameter-free way to merge multiple ranked lists into one. Sum across lists with — and you have the default fusion method in production hybrid-search stacks.

RECIPROCAL RANK FUSIONConsistency across lists beats #1 in any single one.BM25 · LEXICALsparse retriever1.d_Aapple inc 20241/(60 + 1) = 0.0164★ also in dense2.d_Ccitrus market q31/(60 + 2) = 0.0161★ also in dense3.d_Eeggplant trade1/(60 + 3) = 0.0159★ also in dense4.d_Bbanana export1/(60 + 4) = 0.0156★ also in dense5.d_Ddurian futures1/(60 + 5) = 0.0154★ also in denseDENSE · SEMANTICembedding retriever1.d_Ffruit imports1/(60 + 1) = 0.0164★ also in bm252.d_Ccitrus market q31/(60 + 2) = 0.0161★ also in bm253.d_Ggrape harvest1/(60 + 3) = 0.0159★ also in bm254.d_Eeggplant trade1/(60 + 4) = 0.0156★ also in bm255.d_Hhoney trade1/(60 + 5) = 0.0154★ also in bm25FUSION RULERRF(d) = Σi 1 / (k + ranki(d))k = 60 · rank-only · parameter-freeFUSED RANKINGsort ↓ by Σ RRFrankdoc · bm25-r · dense-r · score#1 · d_C2 · 20.0323#2 · d_E3 · 40.0315#3 · d_A1 · 0.0164#4 · d_F · 10.0164#5 · d_G · 30.0159#6 · d_B4 · 0.0156#7 · d_D5 · 0.0154#8 · d_H · 50.0154d_C WINS — never #1 in either list, but consistent in bothTwo parallel retrievers each return a ranked list of five documents.

Reciprocal rank fusion (RRF) merges several ranked lists into one. For a document that appears in lists indexed by at rank :

The canonical constant is , from Cormack, Clarke, and Buettcher (2009). Sort all candidates by their RRF score, descending. That’s the entire algorithm.

The intuition

Two properties fall out of the shape:

  1. High-rank documents in any single list dominate. A document at rank 1 in BM25 contributes , a document at rank 10 contributes , and a document at rank 100 contributes . The decay is slow but real — top-3 hits matter, top-50 hits are mostly noise.
  2. Ties between lists fall to the document with the best worst-rank. If document is rank 1 in list and rank 100 in list , while document is rank 5 in both, then wins. The fusion rewards consistency across legs more than excellence in any single leg.

That second property is the one that earns RRF its place: it’s exactly what you want from , where a document showing up in both the lexical and the dense leg is stronger evidence of relevance than a document that’s #1 in one and absent from the other.

Why it’s the production default

What RRF gives up — and why that's a feature
  • No score normalization. BM25 scores aren’t bounded; cosines are in ; SPLADE scores have their own scale. RRF ignores all of this — the order is the only signal.
  • No tunable parameters. is the answer for almost every stack. Nothing to fit on a held-out set, nothing to overfit, nothing to drift when your retrievers change.
  • No fragility to score-distribution shift. Re-train your dense retriever and the score scale moves; under linear fusion you’d need to re-tune . Under RRF, nothing changes.
  • Reproducible and auditable. Two people implementing RRF independently get bit-identical outputs. Linear combination implementations diverge on normalization details.

RRF wins because it gives up the right things. Rank-only fusion throws away score magnitudes — and score magnitudes are exactly the noisy, leg-specific, hard-to-calibrate signal you don’t want fusion to depend on.

Compared to the alternatives

  • Linear combination (). Requires score normalization (min-max, z-score, or sigmoid) and a tuned . Sensitive to outliers in either leg’s score distribution. Fragile across deployments.
  • CombSUM and CombMNZ. Sum normalized scores (CombSUM) or sum-times-count-of-non-zero (CombMNZ). Predate RRF; same fragility to score scale; rarely used today.
  • Learned fusion. Train a small model on (query, doc, leg-features) → relevance. More accurate when you have enough labels; more infra; can over-fit. The next step up from RRF, not the next-next.

Where RRF doesn’t help

When all your legs are noisy in the same direction — e.g., two dense retrievers fine-tuned on the same upstream data — fusion barely improves recall. RRF assumes the failure modes are uncorrelated; if they’re not, you’re paying double-cost for marginal gain. The diagnostic is the per-leg overlap: if both legs surface the same top-10 for most queries, fusion is structurally redundant.

Rewrite the contribution as . As , the curve flattens — every rank contributes roughly , so a document at rank 1 and a document at rank 1000 are barely distinguishable. As , the curve sharpens to — pure reciprocal rank, where rank 1 absolutely dominates rank 2 (a 2× ratio).

sits in the middle. The ratio between rank 1 and rank 10 is (a 15% premium for being best); the ratio between rank 1 and rank 100 is . That’s a soft but persistent gradient — enough to reward top hits, gentle enough that a single rogue list with the right document at rank 50 still gets that document into the fused top-K.

The choice is not principled. It’s empirical, robust, and parameter-free. That trio is why nobody has displaced it in fifteen years.

Yes — the sum is over arbitrarily many ranked lists. Production stacks routinely fuse three or more: BM25 + dense + a learned sparse leg like , or hybrid + a domain-specific structured-search leg. Each leg contributes independently; the fusion is associative and commutative.

The diminishing-returns intuition: each new leg helps if its top-K mostly contains documents the existing legs ranked below their own top-K. A fourth dense leg trained on the same upstream data adds nearly nothing. A leg over a structured field (titles, code identifiers) typically adds the most because its failure mode is the most distinct.

Go further

Why and does the constant matter?

The 60 is from Cormack-Clarke-Buettcher (2009) and is a smoothing constant. Larger flattens the contribution curve so top-1 doesn't dominate when it's an outlier in the other list; smaller sharpens it. NDCG@10 typically moves less than half a point between and , which is why nobody tunes it. The fusion is rank-only — magnitudes never enter — and that scale-invariance is the whole reason RRF travels.

Why not just normalize scores and add them?

Linear combination requires picking a weight and a normalization scheme, both of which interact with the score distributions of every leg. BM25 scores are unbounded; cosine similarities live in ; learned rerankers can output anything. Tune on a small eval and you overfit; ship with min-max normalization and one outlier query collapses the leg. RRF sidesteps the entire fragility surface by ranking-only fusion.

When should I move past RRF?

When you have a held-out eval with a few hundred labeled queries and a real signal that fusion (not first-pass recall, not the reranker) is the bottleneck. Then a learned listwise fusion, a cascade reranker, or just running a strong cross-encoder on the union starts to pay. Until you hit that bar, RRF is the boring answer that's also the right one.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord