Graded Relevance LLM Judge

Q: What did the MTEB reannotation actually find?

Models everyone assumed were strong rose to the top; models everyone suspected were leaderboard-tuned fell. The most striking effect was on sibling models in the same family — small / mid / large in the same provider's lineup — where binary labels showed near-ties and graded labels opened large gaps. Graded relevance is a higher-resolution measurement of the same underlying ordering quality.

Q: How much does chain-of-thought before the grade actually help?

A lot. Asking the judge to write a one-paragraph rationale before emitting the integer grade typically lifts inter-judge Cohen's kappa from ~0.55 to ~0.75 on retrieval relevance — comparable to human-human agreement. The CoT forces the judge to enumerate which query facets the document covers, which is exactly the reasoning that distinguishes a 2 from a 3.

Also known as: graded relevance, graded relevance evaluation, LLM-graded relevance

TL;DR

An LLM-as-judge configured to emit graded relevance — typically a 0-3 scale (irrelevant / marginal / relevant / highly relevant) rather than a binary yes/no.

A graded relevance LLM judge is an LLM-as-judge configured to emit a graded score per (query, document) pair instead of a binary relevant/not-relevant verdict. The standard scale is 0-3:

0 — Irrelevant. Document does not address the query in any meaningful way.
1 — Marginal. Document is on-topic but does not answer the query.
2 — Relevant. Document partially answers the query or provides directly useful context.
3 — Highly relevant. Document fully and directly answers the query.

This rubric is borrowed from the TREC graded-relevance tradition and operationalized via a frontier LLM (Claude, GPT, Gemini) prompted with the rubric, anchor examples, and a forced-format output. Run it across a benchmark and you get graded labels at human-comparable agreement, in hours instead of months.

The binary-relevance ceiling

Most retrieval benchmarks ship with binary labels — a (query, document) pair is either in the qrels file or it is not. MTEB ’s retrieval portion is binary. BEIR ’s majority of subsets are binary. Even MS MARCO ’s training data is overwhelmingly binary.

Binary labels reward “did you find a relevant doc” but cannot see the difference between “the most relevant doc is at position 1” and “the most relevant doc is at position 9, with eight unrelated docs above it.” Both score 1.0 on recall@10 . Both score 1.0 on hit@10. Two models with very different production behavior produce the same number.

Binary relevance is a one-bit measurement of an ordering. Graded relevance is a two-bit measurement of the same ordering — and the extra bit is exactly where production retrieval quality lives.

The ceiling is not just theoretical. As models converge near the top of binary leaderboards, the per-model differences shrink to noise, and the leaderboard becomes uninformative. Every additional point of recall@10 is harder to win and means less in production. Graded labels reset the headroom by giving the metric something finer-grained to measure.

What graded relevance unlocks

The single most important consequence: graded labels make NDCG@K meaningfully different from recall@K. Under binary labels, NDCG@K and recall@K are highly correlated — both reward “find a relevant doc, place it high.” Under graded labels, NDCG@K rewards ordering by grade: a model that places a 3 above a 2 above a 1 scores higher than a model that retrieves the same three docs but orders them 2-1-3.

What graded relevance lets the eval see

Calibration quality. A well-calibrated model orders by relevance gradient; a leaderboard-tuned model orders by retrieval-trick. Graded labels separate them.
Sibling-model deltas. Small / mid / large in the same provider’s lineup look near-identical under binary labels and visibly different under graded labels. The big model is bigger for a reason; binary just couldn’t see it.
Reranker headroom. A reranker that re-orders a candidate set without changing its membership is invisible to binary recall but plainly visible to graded NDCG.
Per-query difficulty. A query whose top-10 contains one 3, three 2s, and six 0s is a different evaluation case from one with ten 1s. Binary labels cannot distinguish them.
Diagnostic regression detection. A model update that swaps a 3 at position 1 for a 2 at position 1 is a real regression that binary metrics will not flag.

How to build the judge (rubric, CoT, calibration)

Three properties separate a graded-relevance judge that produces stable evaluations from one that does not:

Rubric design. Each grade has an explicit definition and a worked example. “0: irrelevant; 1: marginal; 2: relevant; 3: highly relevant” is the skeleton; the body is a per-grade anchor passage from a domain similar to the eval set. Without anchors, judges drift toward the middle of the scale.

Chain-of-thought before the grade. The prompt asks the judge to write a 1-3 sentence rationale enumerating which facets of the query the document addresses, then emit the integer grade. This is not cosmetic. Empirically, CoT-before-grade lifts inter-judge Cohen’s kappa from ~0.55 to ~0.75 on retrieval relevance — into the range of human-human agreement on the same task. The rationale forces the judge to do the reasoning that a 2-vs-3 distinction requires.

Forced output format. JSON with rationale and grade fields, parsed with a strict schema. Free-form output drops 5-15% of judgments to parse failures and silently biases the eval toward whichever queries had parseable judge outputs.

Multi-judge ensembling. Three frontier LLMs (Claude + GPT + Gemini) judging the same pair, with per-rater calibration via Thurstone or Beta fits, is meaningfully more robust than any single judge. Disagreements are diagnostic — pairs where the three judges split 0/2/3 are usually genuinely ambiguous and worth removing from the eval set.

The skeleton has six sections, in this order:

Task framing — “You are evaluating whether a document is relevant to a search query.”
Grade definitions with anchor examples — each of 0, 1, 2, 3 pinned to a worked example from a similar domain. “Highly relevant (3): the document directly answers the query in full; example: query ‘how does Adam optimizer work,’ document is the original Adam paper abstract.”
The query and document — clearly delimited, with the document truncated to a fixed token budget so judges score comparable amounts of text.
Reasoning instructions — “First, list which facets of the query are addressed by the document. Then list which facets are not. Then assign a grade.”
Forced output format — explicit JSON schema with facets_covered, facets_missing, rationale, grade.
Sanity rule — “If the document is on-topic but does not answer the query, the grade is at most 1.” This single rule prevents the most common failure mode: judges grading topical-similarity instead of relevance.

Inter-judge kappa under this template lands at 0.7-0.8 across Claude / GPT / Gemini on standard retrieval domains. Without the sanity rule and the facet enumeration step it drops to 0.5-0.6, which is too low to be diagnostic.

What we found when we re-ran MTEB this way

ZeroEntropy reannotated 28 retrieval datasets from MTEB using exactly this methodology — three independent LLM judges, 0-3 scale, CoT before grade, per-rater calibration. The result was diagnostic in three specific ways:

Models everyone assumed were great rose to the top. The reannotation rewarded models with well-ordered top-10 lists, regardless of leaderboard position.
Models everyone suspected were leaderboard-tuned fell. Models that retrieved the relevant doc but ordered the top-10 poorly took visible NDCG hits under graded labels.
Sibling-model gaps widened. Small/mid/large lineups in the same provider’s family — including ours, Voyage’s, OpenAI’s — that looked near-tied under binary labels showed clear ordering gaps under graded labels. The bigger models were bigger; binary was just blind to it. In effect-size terms, within-family Cohen’s widened by roughly - across most subsets — the formal signature of “the test got better,” in the sense developed in eval set quality .

The methodology, not the result, is the point. Any benchmark with binary qrels can be re-annotated with graded labels via a properly-built LLM judge, and most binary leaderboards become more informative — sometimes meaningfully different — under that treatment. The MTEB reannotation is the worked example.

Caveats — judge bias, contamination, cost

Graded-relevance LLM judges inherit every bias of the underlying LLM-as-judge stack, plus a few specific to the graded setting:

Length bias amplifies on graded scales. Judges grade longer documents 0.3-0.5 grades higher on average, holding actual content constant. Mitigation: truncate all documents to a fixed token budget, or normalize via a length-matched control.
Position bias does not apply directly (there is only one document per judgment), but anchor-prompt bias does — the order in which the 0/1/2/3 anchor examples appear subtly biases the judge. Mitigation: shuffle anchor order across runs.
Contamination of the judge model. If the judge LLM was trained on the eval queries (e.g., the judge has seen MS MARCO during pretraining), grades skew toward what the judge “remembers” as relevant rather than what the document actually says. Use a held-out judge or a domain the judge demonstrably has not seen.
Cost. A 28-dataset graded reannotation with three frontier judges per pair runs into the hundreds of thousands of API calls. Distillation into a smaller judge model is the practical move at production scale — same logic as zELO ’s pairwise teacher.

The right mental model: graded-relevance judging is a higher-resolution measurement instrument with the same systematic-error footprint as binary judging, plus a few new pitfalls specific to the rubric. The resolution gain dominates the new pitfalls when the benchmark is built carefully — which is why every modern reranker leaderboard is moving in this direction.

Go further

Why does graded relevance discriminate models that binary relevance can't?

Binary collapses the signal at the threshold — two models that both retrieve the relevant doc but place it at position 1 vs position 9 look identical under recall@10. Graded relevance lets the metric see that gap. NDCG@10 with grades produces visibly different scores where binary recall@10 ties; that is the whole reason the field is moving to it.

NDCG@K Recall@K Score calibration

What did the MTEB reannotation actually find?

Models everyone assumed were strong rose to the top; models everyone suspected were leaderboard-tuned fell. The most striking effect was on sibling models in the same family — small / mid / large in the same provider's lineup — where binary labels showed near-ties and graded labels opened large gaps. Graded relevance is a higher-resolution measurement of the same underlying ordering quality.

MTEB BEIR benchmark

How much does chain-of-thought before the grade actually help?

A lot. Asking the judge to write a one-paragraph rationale before emitting the integer grade typically lifts inter-judge Cohen's kappa from ~0.55 to ~0.75 on retrieval relevance — comparable to human-human agreement. The CoT forces the judge to enumerate which query facets the document covers, which is exactly the reasoning that distinguishes a 2 from a 3.

LLM-as-judge Score calibration

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs