Massive Text Embedding Benchmark — a public benchmark covering 50+ datasets across retrieval, classification, clustering, and more. The de facto leaderboard for embedding models, despite some well-documented limitations in its retrieval portion.
MTEB (Massive Text Embedding Benchmark) is a public benchmark suite for embedding models, originally introduced by HuggingFace and now community-maintained. It runs an embedding model across 50+ datasets and ~8 task categories: retrieval, classification, clustering, pair classification, reranking, semantic textual similarity (STS), summarization scoring, and bitext mining.
The MTEB leaderboard on HuggingFace is the most-cited single ranking of embedding models. Every notable model release benchmarks against it.
What it gets right
Breadth — many task types and domains, so a single composite is a reasonable proxy for general-purpose embedding strength.
Reproducibility — anyone can run it end-to-end on a public stack.
Leaderboard pressure — has materially pulled the field forward.
What's actually inside MTEB's 50+ datasets
Retrieval — MS MARCO, NQ, HotpotQA, FiQA, SciFact, NFCorpus, TREC-COVID
MTEB’s retrieval portion uses binary relevance labels — each (query, document) pair is either relevant or not. But real retrieval has grades: highly relevant > marginally relevant > tangentially related > unrelated. Binary labels collapse all of those into a single bit and reward models that get the binary right even if their ordering is poor.
ZeroEntropy’s MTEB Evals work re-annotated 28 retrieval datasets from MTEB with graded relevance — three independent LLM judges each scoring on a 0-10 scale. The ranking of models under graded relevance is meaningfully different from the binary leaderboard. Models that look strong on binary labels but produce noisy orderings drop; models that produce well-calibrated orderings (zembed-1 in particular) jump.
Three observable signatures, all visible to a careful evaluator:
Within-dataset overfitting — models that train on MTEB’s training splits (or paraphrases) post huge wins on those specific datasets and underperform on held-out corpora with similar shape. Compare the same task on a private benchmark; the gap is the contamination.
Binary-vs-graded gap — leaderboard-tuned models often retrieve relevant docs (binary win) but order them poorly (graded loss). Re-running with graded labels reshuffles rankings.
Out-of-domain transfer — strong general-purpose embedders degrade gracefully on unseen domains; overfitted ones cliff. A drop of 15+ NDCG points moving from MTEB to a private domain is a contamination tell.
MTEB-as-marketing
A practical caveat: many embedding models are partially trained on data that overlaps with MTEB datasets (or close paraphrases). This contaminates the leaderboard — the model has seen the eval data and pattern-matches accordingly. When you evaluate on truly held-out, private corpora, MTEB rankings don’t always predict real-world quality.
The remedy: trust MTEB directionally (large gaps mean something) but evaluate on your data before committing.
ZeroEntropy publishes results on both standard MTEB and the graded-relevance reannotation; the graded version is the more honest predictor of production retrieval quality.
Go further
Why are graded labels such a big deal for MTEB?
Binary labels reward 'did you find a relevant doc' but hide whether the most relevant is at position 1 or 9 — both look identical. Graded labels expose ordering quality, where well-calibrated models genuinely separate from leaderboard-tuned ones.
How do I tell if a model has been trained on MTEB data?
You can't, easily — but the best signal is to evaluate on your own private corpus and compare relative rankings. If the leaderboard order doesn't predict your eval order, contamination is a likely culprit.
There's a small reranking subset, but MTEB is primarily an embedder benchmark — it scores models that produce vectors. Reranker evaluation typically uses bespoke benchmarks (BEIR, the 29-dataset reranker suite) where the model scores (query, doc) pairs end-to-end.