MS MARCO

Also known as: MS MARCO, Microsoft MAchine Reading COmprehension, MS MARCO Passage Ranking

TL;DR

MS MARCO is Microsoft's web-search dataset of ~1M Bing queries paired with passages and human relevance judgments. The standard training corpus for retrievers and rerankers, the source of every modern dense retriever.

MS MARCO (Microsoft MAchine Reading COmprehension, 2016) is the dataset that made modern neural retrieval possible. ~1M real Bing queries are paired with answers extracted from web passages, and a downstream “passage ranking” track ships ~503K training queries with one labeled relevant passage each over a corpus of 8.8M passages. Almost every modern dense retriever and reranker is trained on this set.

What’s in the dataset

The “passage ranking” task — by far the most-used split — has:

8.8M passages (web-scraped, ~50-100 words each)
~503K training queries with relevance labels
6,980 dev queries (used as the leaderboard eval)
43 test queries in TREC-DL 2019 / 54 in TREC-DL 2020 with graded relevance judgments (the harder eval)

Labels are sparse: dev queries have ~1.1 relevant passages on average. Training queries are typically labeled with a single positive — what a user clicked on — though some have multiple.

Why MS MARCO became canonical

It was the first retrieval dataset large enough to train neural rankers from scratch. Pre-MS MARCO, IR research ran on TREC datasets (~50 queries each) — too small to fit a transformer. MS MARCO’s ~503K training queries cleared the bar. By 2019, neural rankers fine-tuned on it crushed BM25 on dev; by 2020, every published dense retriever (DPR, ANCE, ColBERTv1) trained on it.

Where it’s load-bearing

Examples

Training corpus for nearly every general-purpose dense retriever (E5, BGE, GTE, Contriever variants).
Foundation for reranker training — most cross-encoders fine-tune on MS MARCO query-passage triples before any domain adaptation.
The standard evaluation for first-pass retrieval improvements — papers report MS MARCO MRR@10 the way LLM papers report MMLU.
Source of TREC-DL 2019/2020/2021 — the more rigorous graded-relevance evals built atop MS MARCO documents.

The label-sparsity problem

The dev set has ~1.1 labeled relevant passages per query out of 8.8M. The other 8,799,998 passages are unlabeled — assumed irrelevant by training code, but in practice many are relevant or near-relevant. During training, random negative sampling will frequently sample a relevant-but-unlabeled passage, which the loss treats as a hard negative to push away.

This noise has two consequences:

Models trained with in-batch negatives alone learn from a noisy signal — random negatives are often easy and uninformative, and the rare false-negatives push the model away from genuinely relevant content.
Hard-negative mining (sourcing negatives from BM25 top-K or from a previous-iteration retriever) helps disambiguate, but introduces its own risk: a top-BM25 candidate that’s actually relevant becomes an even harder false-negative.

The standard fix: mine hard negatives, then re-label them with a stronger model or filter by semantic similarity. ANCE, RocketQA, and SimLM all encode versions of this loop.

Where MS MARCO is overrated

The dataset’s other limits:

English only. MS MARCO is monolingual; multilingual extensions (mMARCO machine-translated, MIRACL) exist but are noisier.
Saturated. Top models on the leaderboard are within rounding error of each other. Differences of 0.5 MRR@10 are usually noise — see statistical significance .
Domain-narrow. Web passages are uniform in length and style. Real corpora — legal, medical, code — look nothing like MS MARCO docs.
Click-bias. Labels reflect what users clicked, not what is most relevant. Models inherit click-popularity bias.

This is the gap BEIR was built to expose: train on MS MARCO, test zero-shot elsewhere, watch performance drop.

Train vs eval split nuances

Most papers report on the dev split (6,980 queries, sparse labels, MRR@10). The harder eval is TREC-DL 2019/2020, which takes a subset of MS MARCO queries and applies dense graded-relevance judgments — the NDCG@10 numbers there are more trustworthy than dev MRR. When comparing models, demand both. A model that wins dev MRR but loses TREC-DL NDCG is fitting label sparsity.

Go further

Why are MS MARCO labels so sparse?

Each query has ~1 relevant passage labeled out of 8.8M total. The corpus is annotated under a what-the-user-clicked model rather than full graded relevance, so most relevant-but-unlabeled passages count as negatives during training. This drives the importance of [hard-negative mining](/concepts/hard-negative-mining/) — random negatives are too easy; many random samples are actually relevant.

Hard-negative mining In-batch negatives

MRR@10 vs NDCG@10 — which one matters for MS MARCO?

The leaderboard uses MRR@10 because the dev set has only one labeled relevant doc per query — NDCG with one binary-relevant doc collapses to a function of MRR. For multi-relevance datasets like TREC-DL (which uses MS MARCO docs but TREC-style graded judgments), NDCG@10 is the metric. Most papers report both.

MRR NDCG@k

Should I train my reranker on MS MARCO?

Probably yes, for general-purpose models. MS MARCO is the largest publicly available query-passage relevance dataset, and most production rerankers use it as the foundation. But add domain-specific data on top — MS MARCO alone produces a model good at web search, not at your enterprise corpus.

Reranker BEIR

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs