LLM-as-judge

Also known as: LLM judge, model-graded eval, LLM rater

TL;DR

Using a frontier LLM to score outputs — relevance, faithfulness, answer quality — at scale where human raters can't keep up. Powerful for graded labels, but introduces position bias, verbosity bias, model bias.

LLM-as-judge is the technique of using a frontier language model — Claude, GPT, Gemini — as an automated rater. Given a query, a candidate output, and a rubric, the LLM emits a score (or a pairwise preference). Repeat across thousands of examples, average, and you have an evaluation signal where you couldn’t afford humans.

The technique is now load-bearing infrastructure for retrieval and RAG evals: the graded relevance labels behind ZeroEntropy’s reannotation, the scores in modern RAG benchmarks, and the pointwise targets behind reranker training all come from LLM judges.

LLM-AS-JUDGE · PAIRWISEA model scores a model. Watch for the biases.QUESTION“Summarize the Treaty of Versailles in one sentence.”CANDIDATE A12 TOKENSThe 1919 treaty endedWWI with terms blamedfor later WWII.CONCISE · ACCURATE · PROSECANDIDATE B38 TOKENS**Treaty (1919)** endedWWI; Germany lost land,paid reparations, wasdisarmed — sowing WWII.DETAILED · BOLD · MARKDOWNJUDGEfrontier LLM · rubric promptedA0.46B0.71P(B ≻ A) = σ(Δscore)VERDICTB ≻ A— “B is more complete; covers reparations.”KNOWN BIASES— mitigate with rubrics, position swap, multi-judge ensembles.POSITIONprefer A or B by orderVERBOSITYlonger looks betterSELF-PREFERENCEjudge favors its familyFORMATmarkdown beats proseAsk a question; produce two candidate answers.

Why it works

Three properties make LLMs surprisingly good raters:

  1. Calibration on well-specified rubrics. “Is this passage relevant to this query, on a scale of 0-3?” gets reasonably consistent scores from a frontier model. Agreement with humans (Cohen’s kappa) typically lands at 0.6-0.8 — in the same range as human-human agreement.
  2. Throughput. A judge can score millions of pairs in hours, where humans would take months at orders more cost.
  3. Reproducibility. Re-runnable when the eval set changes; humans aren’t.

What goes wrong

These aren’t speculative — they’re measurable and large enough to flip rankings. Mitigations:

  • Randomize position. Shuffle (A, B) order across pairs; if the judge’s preference flips, take a no-confidence vote.
  • Use multiple judges. Claude + GPT + Gemini ensembled (with per-rater calibration via or Beta fits) is much more robust than any single judge.
  • Tighter rubric. “Relevance 0-3” with explicit definitions and examples beats “rate quality 1-10”.
  • Pairwise over pointwise. “Which is better, A or B?” is more consistent than “rate A on its own”. See .

A judge rubric that produces stable scores has three properties: explicit category definitions, anchor examples, and a forced output format. The TREC graded-relevance scale is a textbook example — “0: not relevant; 1: related but not directly useful; 2: relevant but partial; 3: highly relevant and complete” — each level pinned to a worked example. Forced output format means the prompt requests JSON with a single score field plus a rationale, so parsing is robust and you can audit disagreement cases. The biggest failure mode is the tempting open-ended rubric (“rate quality from 1 to 10”) — agreement collapses because the judge has no shared meaning of “8 vs 9,” and small prompt edits swing the average score by 1+ points.

Pointwise vs pairwise

Pairwise judges are more reliable but produce judgments — expensive at scale. The modern recipe:

  1. Sample pairs cleverly (round-robin, BTL-balanced, or active sampling).
  2. Collect pairwise preferences from one or more LLM judges.
  3. Fit a Thurstone or Bradley-Terry model to recover continuous scores per item.
  4. Optionally distill those scores into a pointwise model that doesn’t need the pairwise overhead.

This is exactly what does for reranker training, and it’s what produces the calibrated that downstream thresholding relies on.

What LLM-as-judge is good for

Where LLM-as-judge is load-bearing today
  • Graded relevance for retrieval evaluation — the MTEB rework and most modern reranker benchmarks use it.
  • Faithfulness scoring for RAG outputs — judging whether the answer is grounded in the retrieved context.
  • Pairwise preferences for training data — DPO, RLHF-from-LLM-feedback, and reranker distillation pipelines.
  • Open-ended quality scoring with caveats — agreement with humans is lower here, but the throughput is irreplaceable.
  • Eval-in-prod sampling — scoring a fraction of live traffic to catch silent quality regressions inside hours, not weeks.

What it’s not good for

  • Domains the judge model wasn’t trained on (deeply specialized law, medicine without context).
  • Adversarial outputs designed to game judge biases.
  • Eval sets where the judge’s own outputs are in the candidate pool (use a different judge, or remove that candidate).

The right mental model: it’s not a replacement for human evaluation — it’s a force multiplier that lets you scale to corpus sizes where human eval was impossible. Sanity-check with humans on a sample, always.

LLM-as-judge is human-in-the-loop with the loop closed by a model.

Go further

What biases does LLM-as-judge actually have?

Position bias (preferring whichever answer comes first or last), verbosity bias (longer answers look better), self-preference bias (a model rates its own outputs higher), and consistency drift (the same prompt scored differently across calls). All of these are measurable, and most are mitigable with rubric design and randomized order.

Pointwise vs pairwise judging — which is more reliable?

Pairwise. LLMs are more consistent at 'is A better than B?' than at 'is A a 7 out of 10?'. The downside is O(N²) comparisons. The standard fix: collect pairwise preferences, fit a Thurstone or Bradley-Terry model to recover continuous scores, then train a pointwise model against those scores. zELO does exactly this.

How well does LLM judgment agree with human judgment?

On well-defined tasks (relevance, faithfulness, factual correctness), Cohen's kappa with humans is typically 0.6-0.8 — comparable to human-human agreement on the same task. On open-ended quality judgments it's lower, ~0.4-0.6, and the variance across LLM judges is enough to matter. Always sample human-judge a subset to measure agreement.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord