Also known as: BEIR, Benchmarking Information Retrieval
TL;DR
BEIR is a heterogeneous benchmark of 18 retrieval datasets across domains — biomedical, news, finance, scientific QA, fact-checking — designed to test zero-shot retrieval. The standard reference for whether a retriever generalizes beyond MS MARCO.
BEIR (Benchmarking-IR) is the de facto standard for zero-shot retrieval evaluation. Released by Thakur et al. in 2021, it bundles 18 heterogeneous public datasets — Natural Questions, FiQA, SciFact, TREC-COVID, NFCorpus, ArguAna, and others — into a single evaluation harness. Models are trained on one source (usually MS MARCO ) and tested zero-shot on the rest. The headline number is mean NDCG@10 across the suite.
What BEIR actually tests
The point of BEIR is transfer across heterogeneous domains. Train one model on MS MARCO web search; test it cold on biomedical literature, scientific claim verification, financial opinion, counter-argument retrieval, multi-hop QA, near-paraphrase detection. Each of the 18 datasets stresses a different retrieval failure mode — vocabulary mismatch, fine-grained relevance distinctions, semantic opposition, compositional relevance — and the average NDCG@10 across the suite is meant to summarize how broadly a retriever generalizes.
That breadth is the entire reason BEIR exists. The single MS MARCO leaderboard rewarded models that overfit to web-search question form; BEIR rewards models that generalize.
Datasets and what they probe
Examples
TREC-COVID — biomedical literature search; tests vocabulary mismatch (queries in lay language, docs in clinical).
Touché-2020 — argument retrieval on controversial topics; tests stance-aware ranking.
The diversity is the point — a model that wins TREC-COVID and loses ArguAna has uneven generalization, and the BEIR headline number captures that.
Why NDCG@10 is the headline
BEIR datasets vary wildly in relevance shape. Some have one relevant doc per query (NQ); some have hundreds (TREC-COVID). Recall@10 is uninterpretable across them — you cannot meaningfully average “found 8 of 10 relevant” with “found 8 of 200.” NDCG@k normalizes against an ideal ranking per query, so per-query scores live in and average meaningfully across datasets.
The binary-relevance limitation
BEIR’s headline number hides a sharper problem than which datasets it covers: its relevance labels are binary. Every (query, document) pair is either relevant or not, full stop. NDCG, designed around graded relevance, degenerates when every relevant doc carries the same weight — a paper that comprehensively answers the query and one that mentions a related term in passing score identically. When frontier models are separated by fractions of a percent, this matters: the metric simply cannot tell two strong systems apart on a query where they retrieve the same documents in different orders.
The fix is graded judgments. ZeroEntropy re-annotated 28 MTEB retrieval datasets with three independent LLM judges on a 0–10 scale, and the resulting leaderboard reorders meaningfully — sibling models in the same family that were indistinguishable under binary labels open up a clear gap under graded scoring. The dataset stayed the same. The measurement sharpened.
Enterprise data looks fundamentally different: contracts with section numbering, support tickets with abbreviations and product names, internal wikis with implicit context, log files, code documentation. The query distribution shifts from “natural-language information needs” to “find the clause matching this fact pattern” or “find the ticket where this exact error appeared.”
Models that win BEIR can still lose on enterprise data because:
Vocabulary distribution is skewed (BEIR overweights medical/scientific; enterprise overweights jargon)
Document length and structure differ (BEIR docs are short paragraphs; enterprise docs are long structured)
Relevance grading is implicit and context-dependent in enterprise
The empirical rule: a top-5 BEIR model is necessary but not sufficient for production enterprise retrieval. Always evaluate on your own data.
Variants and adjacent benchmarks
MTEB (Muennighoff et al., 2022) absorbs BEIR’s retrieval split and adds classification, clustering, STS, and other embedding-relevant tasks. MTEB-Retrieval and BEIR scores on overlapping datasets are essentially the same numbers. For multilingual, MIRACL and MIRACL-VISION extend the BEIR shape across languages — see cross-lingual retrieval for the multilingual story.
For reranking-specific work, BEIR is run as a two-stage eval: BM25 retrieves top-100, the reranker reorders. The reranker’s lift over BM25 alone is the headline.
Practical use
Treat BEIR as a sanity check, not a release gate. A new embedding or reranker should land within 1-2 NDCG@10 points of the leaderboard top; if it doesn’t, something is broken in training. The difference between rank 1 and rank 5 on BEIR rarely predicts the difference on your production traffic. Build a small in-domain eval set and trust it more than the leaderboard.
Go further
Why is NDCG@10 the headline BEIR metric?
BEIR datasets have wildly varying numbers of relevant documents per query — graded relevance is the only fair way to compare. NDCG@10 captures both whether the relevant docs are retrieved and how high they rank, in a single number that is comparable across datasets with different relevance distributions.
BEIR is purely retrieval (one query, many candidate docs, find the relevant ones). MTEB is broader — embedding tasks including classification, clustering, STS, summarization. MTEB includes the BEIR datasets as its retrieval split, so a strong BEIR model is a strong MTEB-retrieval model by construction.
BEIR datasets are well-studied — leaderboards saturate via subtle overfitting, and the domains skew academic/Western/English. A strong BEIR score does not imply strong performance on enterprise corpora (legal contracts, support tickets, internal docs) which look nothing like CQADupStack or HotpotQA. Always evaluate on your own data.