MTEB

Also known as: Massive Text Embedding Benchmark

TL;DR

Massive Text Embedding Benchmark — a public benchmark covering 50+ datasets across retrieval, classification, clustering, and more. The de facto leaderboard for embedding models, despite some well-documented limitations in its retrieval portion.

MTEB (Massive Text Embedding Benchmark) is a public benchmark suite for embedding models, originally introduced by HuggingFace and now community-maintained. It runs an embedding model across 50+ datasets and ~8 task categories: retrieval, classification, clustering, pair classification, reranking, semantic textual similarity (STS), summarization scoring, and bitext mining.

The MTEB leaderboard on HuggingFace is the most-cited single ranking of embedding models. Every notable model release benchmarks against it.

What it gets right

  • Breadth — many task types and domains, so a single composite is a reasonable proxy for general-purpose embedding strength.
  • Reproducibility — anyone can run it end-to-end on a public stack.
  • Leaderboard pressure — has materially pulled the field forward.
What's actually inside MTEB's 50+ datasets
  • Retrieval — MS MARCO, NQ, HotpotQA, FiQA, SciFact, NFCorpus, TREC-COVID
  • Classification — Banking77, Emotion, Amazon Reviews, MTOP intent
  • Clustering — arXiv, BiorxivS2S, RedditClustering, StackExchange
  • STS — STS Benchmark, BIOSSES, SICK-R, semantic similarity scoring
  • Reranking — AskUbuntu, SciDocs, StackOverflowDupQuestions

What it gets wrong (and why)

MTEB’s retrieval portion uses binary relevance labels — each (query, document) pair is either relevant or not. But real retrieval has grades: highly relevant > marginally relevant > tangentially related > unrelated. Binary labels collapse all of those into a single bit and reward models that get the binary right even if their ordering is poor.

ZeroEntropy’s MTEB Evals work re-annotated 28 retrieval datasets from MTEB with graded relevance — three independent LLM judges each scoring on a 0-10 scale. The ranking of models under graded relevance is meaningfully different from the binary leaderboard. Models that look strong on binary labels but produce noisy orderings drop; models that produce well-calibrated orderings (zembed-1 in particular) jump.

Three observable signatures, all visible to a careful evaluator:

Within-dataset overfitting — models that train on MTEB’s training splits (or paraphrases) post huge wins on those specific datasets and underperform on held-out corpora with similar shape. Compare the same task on a private benchmark; the gap is the contamination.

Binary-vs-graded gap — leaderboard-tuned models often retrieve relevant docs (binary win) but order them poorly (graded loss). Re-running with graded labels reshuffles rankings.

Out-of-domain transfer — strong general-purpose embedders degrade gracefully on unseen domains; overfitted ones cliff. A drop of 15+ NDCG points moving from MTEB to a private domain is a contamination tell.

MTEB-as-marketing

A practical caveat: many embedding models are partially trained on data that overlaps with MTEB datasets (or close paraphrases). This contaminates the leaderboard — the model has seen the eval data and pattern-matches accordingly. When you evaluate on truly held-out, private corpora, MTEB rankings don’t always predict real-world quality.

The remedy: trust MTEB directionally (large gaps mean something) but evaluate on your data before committing.

ZeroEntropy publishes results on both standard MTEB and the graded-relevance reannotation; the graded version is the more honest predictor of production retrieval quality.

Go further

Why are graded labels such a big deal for MTEB?

Binary labels reward 'did you find a relevant doc' but hide whether the most relevant is at position 1 or 9 — both look identical. Graded labels expose ordering quality, where well-calibrated models genuinely separate from leaderboard-tuned ones.

How do I tell if a model has been trained on MTEB data?

You can't, easily — but the best signal is to evaluate on your own private corpus and compare relative rankings. If the leaderboard order doesn't predict your eval order, contamination is a likely culprit.

Does MTEB cover rerankers?

There's a small reranking subset, but MTEB is primarily an embedder benchmark — it scores models that produce vectors. Reranker evaluation typically uses bespoke benchmarks (BEIR, the 29-dataset reranker suite) where the model scores (query, doc) pairs end-to-end.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord