BEIR Benchmark

Also known as: BEIR, Benchmarking Information Retrieval

TL;DR

BEIR is a heterogeneous benchmark of 18 retrieval datasets across domains — biomedical, news, finance, scientific QA, fact-checking — designed to test zero-shot retrieval. The standard reference for whether a retriever generalizes beyond MS MARCO.

BEIR (Benchmarking-IR) is the de facto standard for zero-shot retrieval evaluation. Released by Thakur et al. in 2021, it bundles 18 heterogeneous public datasets — Natural Questions, FiQA, SciFact, TREC-COVID, NFCorpus, ArguAna, and others — into a single evaluation harness. Models are trained on one source (usually ) and tested zero-shot on the rest. The headline number is mean NDCG@10 across the suite.

BEIR · ZERO-SHOT TRANSFER BENCHMARKOne source. Eighteen target domains. One average score.TRAINING SOURCEMS MARCOweb-search QATRAINED ON SOURCEretrieverdense / hybridTREC-COVIDprobes vocabulary mismatchFiQA-2018probes informal-language toleranceNFCorpusprobes long-tail biomedical relevanceArguAnaprobes semantic oppositionSciFactprobes fine-grained claim verificationHotpotQAprobes multi-hop compositionCQADupStackprobes near-paraphrase detectionTouché-2020probes stance-aware ranking+ 10 MORE · NQ · QUORA · DBPEDIA · CLIMATE-FEVER · …HEADLINE METRICmean NDCG@10 across all 18 target domains

What BEIR actually tests

The point of BEIR is transfer across heterogeneous domains. Train one model on MS MARCO web search; test it cold on biomedical literature, scientific claim verification, financial opinion, counter-argument retrieval, multi-hop QA, near-paraphrase detection. Each of the 18 datasets stresses a different retrieval failure mode — vocabulary mismatch, fine-grained relevance distinctions, semantic opposition, compositional relevance — and the average NDCG@10 across the suite is meant to summarize how broadly a retriever generalizes.

That breadth is the entire reason BEIR exists. The single MS MARCO leaderboard rewarded models that overfit to web-search question form; BEIR rewards models that generalize.

Datasets and what they probe

Examples
  • TREC-COVID — biomedical literature search; tests vocabulary mismatch (queries in lay language, docs in clinical).
  • FiQA-2018 — financial opinion questions; tests informal-language tolerance.
  • SciFact — scientific claim verification; tests fine-grained relevance distinctions.
  • ArguAna — counter-argument retrieval; tests semantic opposition rather than similarity.
  • HotpotQA — multi-hop QA; tests retrieval where relevance requires composing two documents.
  • CQADupStack — Stack Exchange duplicate questions; tests near-paraphrase detection across 12 sub-domains.
  • Touché-2020 — argument retrieval on controversial topics; tests stance-aware ranking.

The diversity is the point — a model that wins TREC-COVID and loses ArguAna has uneven generalization, and the BEIR headline number captures that.

Why NDCG@10 is the headline

BEIR datasets vary wildly in relevance shape. Some have one relevant doc per query (NQ); some have hundreds (TREC-COVID). is uninterpretable across them — you cannot meaningfully average “found 8 of 10 relevant” with “found 8 of 200.” normalizes against an ideal ranking per query, so per-query scores live in and average meaningfully across datasets.

The binary-relevance limitation

BEIR’s headline number hides a sharper problem than which datasets it covers: its relevance labels are binary. Every (query, document) pair is either relevant or not, full stop. NDCG, designed around graded relevance, degenerates when every relevant doc carries the same weight — a paper that comprehensively answers the query and one that mentions a related term in passing score identically. When frontier models are separated by fractions of a percent, this matters: the metric simply cannot tell two strong systems apart on a query where they retrieve the same documents in different orders.

The fix is graded judgments. ZeroEntropy re-annotated 28 MTEB retrieval datasets with three independent LLM judges on a 0–10 scale, and the resulting leaderboard reorders meaningfully — sibling models in the same family that were indistinguishable under binary labels open up a clear gap under graded scoring. The dataset stayed the same. The measurement sharpened.

What BEIR misses

BEIR’s corpora are mostly:

  • Public web text (NQ, MS MARCO, Quora)
  • Academic abstracts (TREC-COVID, SciFact, NFCorpus)
  • Curated forum content (CQADupStack, FiQA)

Enterprise data looks fundamentally different: contracts with section numbering, support tickets with abbreviations and product names, internal wikis with implicit context, log files, code documentation. The query distribution shifts from “natural-language information needs” to “find the clause matching this fact pattern” or “find the ticket where this exact error appeared.”

Models that win BEIR can still lose on enterprise data because:

  • Vocabulary distribution is skewed (BEIR overweights medical/scientific; enterprise overweights jargon)
  • Document length and structure differ (BEIR docs are short paragraphs; enterprise docs are long structured)
  • Relevance grading is implicit and context-dependent in enterprise

The empirical rule: a top-5 BEIR model is necessary but not sufficient for production enterprise retrieval. Always evaluate on your own data.

Variants and adjacent benchmarks

(Muennighoff et al., 2022) absorbs BEIR’s retrieval split and adds classification, clustering, STS, and other embedding-relevant tasks. MTEB-Retrieval and BEIR scores on overlapping datasets are essentially the same numbers. For multilingual, MIRACL and MIRACL-VISION extend the BEIR shape across languages — see for the multilingual story.

For reranking-specific work, BEIR is run as a two-stage eval: BM25 retrieves top-100, the reranker reorders. The reranker’s lift over BM25 alone is the headline.

Practical use

Treat BEIR as a sanity check, not a release gate. A new embedding or reranker should land within 1-2 NDCG@10 points of the leaderboard top; if it doesn’t, something is broken in training. The difference between rank 1 and rank 5 on BEIR rarely predicts the difference on your production traffic. Build a small in-domain eval set and trust it more than the leaderboard.

Go further

Why is NDCG@10 the headline BEIR metric?

BEIR datasets have wildly varying numbers of relevant documents per query — graded relevance is the only fair way to compare. NDCG@10 captures both whether the relevant docs are retrieved and how high they rank, in a single number that is comparable across datasets with different relevance distributions.

How does BEIR differ from MTEB?

BEIR is purely retrieval (one query, many candidate docs, find the relevant ones). MTEB is broader — embedding tasks including classification, clustering, STS, summarization. MTEB includes the BEIR datasets as its retrieval split, so a strong BEIR model is a strong MTEB-retrieval model by construction.

Why is BEIR not enough on its own?

BEIR datasets are well-studied — leaderboards saturate via subtle overfitting, and the domains skew academic/Western/English. A strong BEIR score does not imply strong performance on enterprise corpora (legal contracts, support tickets, internal docs) which look nothing like CQADupStack or HotpotQA. Always evaluate on your own data.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord