Eval Set Quality

Also known as: benchmark psychometrics, eval-set diagnostics, benchmark health diagnostic

TL;DR

A practical diagnostic checklist for is this benchmark actually any good? Layer four measurement-theory tools — for per-item pathologies, Cronbach's for aggregate reliability.

Eval set quality is the degree to which a benchmark can resolve real differences between models. A benchmark on which sibling architectures at several parameter counts (e.g., 0.6B, 1.7B, 4B, 32B) all score within of each other has failed to resolve the scaling curve for that family, irrespective of which metric is reported. The diagnostic tools for benchmark quality — item difficulty and discrimination, internal consistency, inter-rater agreement, paired effect size — were developed in classical psychometrics through the early-to-mid 20th century (Spearman 1904, Cronbach 1951, Cohen 1960) and apply without modification to ML benchmarks.

Six axes characterize where an eval set can fail:

  1. Item-level pathologies. Items at ceiling, floor, or with weak item-by-total correlation. Diagnosed by .
  2. Aggregate unreliability. Items do not internally agree on what they measure. Diagnosed by .
  3. Label noise. Gold labels are inconsistent across annotators. Diagnosed by .
  4. Insufficient discriminative power. The benchmark does not separate known-different models. Diagnosed by within-family Cohen’s .
  5. Pool incompleteness. For retrieval evals, qrels omit relevant documents not retrieved by any system in the original pool. Mitigated by bpref or condensed metrics.
  6. Contamination. Test data has leaked into the pretraining corpus. Detected by n-gram or perplexity heuristics; not fully eliminable.

Item-level pathologies (CTT)

For each item in the benchmark, two statistics summarize its behavior:

  • Item difficulty — the fraction (or mean grade) of test-takers who answer correctly.
  • Item discrimination — the point-biserial correlation between the item score and the test-taker’s total score.

The categories of pathological items, with conventional action thresholds:

CTT red flags
  • Ceiling items. . Every model answers correctly; the item carries no signal between top-tier systems.
  • Floor items. . No model answers correctly; either the item is beyond the frontier or it is mislabeled.
  • Noise items. . The item correlates weakly with the rest of the test and likely measures a different construct.
  • Backwards items. . Stronger test-takers perform worse — almost always a labeling error or a contaminated answer key.

A benchmark with more than 30% of items in any of these categories requires triage: drop or re-annotate the affected items and recompute everything downstream (α, model rankings, leaderboard order). Rankings often change visibly after the bad items are removed.

Aggregate reliability (Cronbach’s α)

After CTT-based pruning, internal consistency of the trimmed test is measured by :

Conventional thresholds: excellent, good, acceptable, problematic. α should be reported alongside (the number of items), because the Spearman-Brown relation causes α to increase with test length even when added items are pure noise; comparing α across benchmarks of different sizes without is unsound.

α also assumes tau-equivalence — every item measures the same underlying construct. Multi-domain benchmarks (e.g. , whose subsets span math, biology, code, and general knowledge) violate this assumption directly. The corresponding correction is to compute α within each subdomain and report the per-subdomain distribution. A single pooled α on a heterogeneous suite is a length artifact rather than a reliability measure.

The Spearman-Brown prediction formula states that lengthening a test by a factor raises α to . In practice, this means a test of 200 items with each item contributing tiny but non-zero true variance can reach even if any single item is nearly worthless. The corollary is that must always be reported, and the strict comparison across benchmarks is per-item α — typically estimated by the inverse of Spearman-Brown applied to the pooled α and .

The other practical guard is to recompute α after removing the bottom quintile of items. If α stays roughly constant after removing the weakest items, the inflation hypothesis is supported and the original headline α was carrying noise.

Label noise (Cohen’s κ)

A test with internally-consistent items still fails if its gold labels are inconsistent across annotators. The standard inter-rater agreement statistic is :

where is observed agreement and is agreement expected by chance. For ordinal labels (e.g. graded relevance on a 0-3 scale), the appropriate variant is weighted κ with quadratic weights, which correctly penalizes large disagreements more than small ones.

The conventional thresholds: indicates label noise dominates; is the floor for a usable benchmark; is considered gold-standard. LLM judges with chain-of-thought before grading, anchor examples, and forced output schema typically reach on retrieval relevance — comparable to reported human-human ceilings in the same task.

Discriminative power (within-family Cohen’s )

Statistical significance becomes trivial at large : with thousands of items, almost any non-zero metric difference clears . The relevant question for benchmark quality is therefore whether the benchmark resolves models that are independently known to differ.

The standard experiment is the sibling-family discrimination test. Given a model family at multiple parameter counts (for example 0.6B / 1.7B / 4B / 32B of one architecture), compute paired between adjacent siblings on the benchmark:

where is the mean per-query (or per-item) metric difference and is its standard deviation. The conventional thresholds, following Cohen (1988):

Within-family thresholds
  • — the benchmark resolves the scaling curve clearly.
  • — siblings are distinguished but not crisply.
  • — the signal is barely above per-query noise.
  • — adjacent siblings are not distinguishable.

Within-family Cohen’s is the most diagnostic single statistic for benchmark quality. If sibling architectures at parameter counts spanning two orders of magnitude all score within of each other, the benchmark does not resolve the scaling curve. A different benchmark on the same models that shows clear separation is, by this criterion, the better measurement.

Pool quality (retrieval-specific)

For retrieval benchmarks such as and TREC, gold relevance labels are produced by pooling: a fixed set of contributor systems is run on each query, the union of their top- candidates is judged, and any document not retrieved by any contributor is implicitly labeled non-relevant. The procedure is sound when the pool is large and diverse enough that essentially all relevant documents appear in it.

Two failure modes arise. First, when a new system retrieves relevant documents not present in the original pool, those documents default to non-relevant and the new system is penalized for finding genuinely relevant material. Second, when the pool itself is small — older TREC tracks pooled hundreds of contributors, whereas modern benchmarks often pool five to ten — the share of unjudged-but-relevant documents grows with the number of systems being compared.

The standard mitigation is bpref (binary preference), which scores only judged documents (counting the fraction of judged-relevant documents ranked above judged-non-relevant ones) and ignores unjudged documents in both the numerator and denominator. NDCG computed on a condensed list — unjudged documents removed before scoring — is a closely related approach.

Empirical work on TREC ad-hoc tracks established that pools of 100+ contributors with judging depth of 100 documents per query produce qrels that are stable under removing any single contributor. Modern benchmarks rarely satisfy this. For BEIR and similar retrieval suites assembled in the embedding era, the practical mitigation is to re-judge the union of the top- from a fresh set of modern systems (typically by LLM judge with a graded relevance rubric), rather than relying on the original binary qrels. The cost is non-trivial — hundreds of thousands of judgments at retrieval depth — but it removes the dominant source of bias in absolute metric comparisons across model generations.

Contamination

If test data leaked into the pretraining corpus, evaluating that test measures the model’s memory rather than its capability. Contamination is the only axis on this list that admits no clean statistical test; it is mitigated by operational discipline.

Contamination heuristics
  • N-gram overlap. Verbatim 13-gram matches between test items and the pretraining corpus. Applicable only to models whose training data is public.
  • Perplexity gap. Contaminated items have anomalously low perplexity under the contaminated model relative to matched-domain held-out items.
  • Date-stratified evaluation. Scoring on items released after the model’s training cutoff vs. before. A large pre-cutoff advantage suggests contamination.
  • Private held-out sets. A benchmark the model demonstrably has not seen — the strongest signal, requiring deliberate withholding before any model release.
  • Canary strings. Distinctive strings deliberately placed in test data; if the model can complete them, the data leaked.

A clean evaluation on a privately-curated held-out corpus is more informative than any number of ablations on public benchmarks.

Layering across MTEB and BEIR

The diagnostics above are not uniform across the constituents of a large benchmark suite. Within ’s 50+ datasets, some subsets (e.g. ArguAna, TREC-COVID) satisfy every axis: well-spread , healthy , sufficient κ, and clear within-family . Others fail several at once — certain FiQA and CQADupStack subsets are concentrated near ceiling for frontier embedders ( between adjacent sibling sizes), and several Touché-2020 splits have low original-annotation κ with metric differences that fail to survive a paired bootstrap.

The implication is that a single suite-level composite (mean across all subsets) loses signal. The corresponding correction is to weight by per-dataset reliability and to report the unweighted variant separately, with the per-axis diagnostic numbers attached. The same pattern applies to , , and other large retrieval suites.

A reproducible procedure that produces all six axes’ statistics:

  1. Score a model panel. At least 20 models spanning the ability range — small to flagship, multiple providers, multiple architectures. CTT diagnostics need variance in test-taker ability to be informative.
  2. CTT per item. Compute and for every item. Flag ceiling, floor, noise, and backwards categories.
  3. Trim and compute α. On the trimmed test, compute Cronbach’s α whole-suite and per subdomain. Report alongside.
  4. Label-noise sample. On a sample of items, have at least two independent annotators (or judge passes) label. Compute weighted κ.
  5. Within-family . Pick a sibling lineup (e.g. 0.6B / 1.7B / 4B / 32B of one architecture). Compute paired Cohen’s between every adjacent pair.
  6. Pool and contamination spot-check. For retrieval, re-judge the top- of two or three modern systems not in the original pool. For all benchmarks, hold out a private validation set.

The benchmark passes the bar if every axis clears: α > 0.7 (whole-suite and 80%+ of subdomains), fewer than 20% pathological items per CTT category, κ > 0.6 (weighted κ > 0.7 on graded labels), within-family between adjacent siblings, bpref reported for retrieval suites, and a private held-out set certifying no contamination.

Re-annotating a binary-relevance retrieval suite with graded labels (typically 0-3 via a multi-judge LLM pipeline) and then re-pruning by CTT typically produces the following changes across the six axes:

  • Item-level (CTT). Ceiling-item fraction decreases — graded labels distinguish items that binary labels conflated as “all correct.”
  • Cronbach’s α. Increases on most subsets, since graded labels carry more variance per item than binary labels.
  • Inter-rater κ. Requires CoT-before-grade in the judge prompt to clear 0.7; otherwise judges agree on the easy cases and disagree at the boundaries.
  • Within-family . Widens substantially. On retrieval suites that have been regraded, sibling-model values widen by roughly - across most subsets; previously-indistinguishable models become clearly ordered.
  • Pool quality. Improves to the extent that the regrading pipeline re-pools across modern systems rather than re-using the original qrels.
  • Contamination. Orthogonal — not addressed by labeling, requires held-out data.

The widening of within-family is the formal statistical signature of “the test got better”: same items, same models, sharper measurement, larger effect sizes.

Go further

What's the single most diagnostic stat for is this benchmark dead?

Within-family Cohen's . Take a sibling lineup — the same model architecture at multiple parameter counts (0.6B, 1.7B, 4B, 32B) — and compute between adjacent siblings on the benchmark. If between known-different model sizes, the benchmark cannot resolve the scaling curve. That's the signature of a saturated, noisy, or off-target eval. The dataset where is largest at this comparison is the most discriminating dataset in your suite.

How do I run this checklist on an existing public benchmark?

Five steps. (1) Score models spanning the ability range. (2) Compute per-item and ; flag items at the extremes. (3) Compute aggregate Cronbach's and per-subdomain if multi-domain. (4) If you have multi-judge labels, compute κ between judges. (5) Take the closest sibling pairs in your model set and compute paired Cohen's . The benchmark passes the bar if α > 0.7, < 20% of items at ceiling/floor, < 20% with , κ > 0.6, and within-family between adjacent siblings.

What does fixing an eval actually look like?

Re-annotate with graded labels via a multi-judge LLM pipeline, drop items with or , re-score the same model lineup, and re-measure within-family . The diagnostic that the fix worked is widening — sibling models that previously scored within noise now show clear separation. The ZE MTEB redo did exactly this, and the within-family values widened by ~3-5x across most subsets. That's the formal version of the test got better.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord