Eval-Spend Overrun
LLM-as-judge becomes the dominant inference line item because every dashboard refresh re-scores the same (query, doc) pairs against the same teacher.
Symptom
Eval is the largest line on the LLM invoice — larger than answer generation. The pattern is recognizable from four co-occurring signals:
- Eval spend exceeds 40% of total LLM spend and trends upward release over release, eventually crossing production inference.
- Per-refresh cost is flat in user count. The eval set grows with refresh frequency and candidate fan-out, not with traffic.
- Most calls hit a frontier judge. The teacher is the strongest available model, applied to every
(query, doc)pair regardless of difficulty. - The fraction of eval spend that is structurally necessary is unattributed. Scoring is everything-against-everything, and the dashboard is load-bearing enough that no one cuts.
The eval is honest. The dashboards are useful. The bill is real. All three statements describe the same pipeline.
“Our LLM-as-judge bill is bigger than our production inference bill.”
Mechanism
LLM-as-judge is the lazy-but-correct way to score relevance at scale: feed (query, doc) to a frontier model, ask for a 0–10 score or a pairwise preference, average across two or three judges. The first version ships in an afternoon and works. Its cost model is not the cost model of inference, and that is the part that surprises production teams.
Production inference scales as O(queries × answer-tokens). Eval scales as O(queries × candidates × judge-prompt-tokens × refresh-frequency × n-judges). The five-factor product is what makes eval diverge from inference. Most factors are silently multiplicative.
A canonical configuration: 1,000–5,000 query eval set, top-10–top-20 candidates per query, a 2,000–4,000-token judge prompt, 100–300 output tokens per call. At frontier-judge pricing of roughly
Compounding is where the line item appears. Pairwise judging blows top-N into N(N-1)/2 comparisons. Ensembling adds a 3–5× multiplier. Refreshes across main plus 3–5 active feature branches add another 4–6×. CI runs on the retrieval branch add a steady tail. Stacked multipliers land typical configurations at 100–200× the naive cost, which is the range where eval crosses inference.
Four places the cost compounds, in roughly decreasing order of how much production systems overpay:
- Same pairs re-scored on every refresh. The eval set does not change between Monday and Tuesday; both days consume a full re-scoring pass because the pipeline has no cache keyed on
(query, doc, judge_version). - Frontier teacher used for every call. The bulk of pairs are obviously relevant or obviously irrelevant. A frontier model confirming the obvious is the wrong tier of compute.
- Absolute scoring instead of pairwise on hard pairs only. A 0–10 absolute score is roughly twice as expensive per useful bit of signal as a pairwise judgment, and noisier on calibration. Full pairwise on all pairs is worse — it is
O(N²)of a problem that should beO(N log N). - No discount-tier batching. Most providers offer ~50% off for async batch APIs with a multi-hour SLA. Eval dashboards rarely need real-time scoring; a nightly refresh sits inside a 24-hour batch comfortably.
Each lever is independent. Stacked, they reach the cost-band a distilled judge sits in — about two orders of magnitude below the naive version.
Diagnostic
Four tests, ordered cheap-to-expensive. The answer is usually obvious by Test 2.
Test 1 — eval-spend share of LLM bill (30 seconds, one SQL query)
SELECT
CASE WHEN purpose = 'judge' THEN 'eval' ELSE 'inference' END AS bucket,
SUM(input_tokens * input_price
+ output_tokens * output_price) AS spend_usd,
COUNT(*) AS calls
FROM llm_usage_log
WHERE day >= CURRENT_DATE - 30
GROUP BY bucket;
-- assumed columns: purpose, input_tokens, output_tokens,
-- input_price (USD/token), output_price (USD/token), day
| eval / inference ratio | reading |
|---|---|
| under 0.2 | healthy; eval is a normal overhead, pattern is dormant |
| 0.2–0.5 | suspect; run Test 2 |
| 0.5–1.0 | pattern firing; eval is co-dominant with production |
| over 1.0 | pattern dominant; eval has overtaken inference |
Absence of a purpose = 'judge' tag in usage logs is itself a finding — eval has been paid for without attribution, which is how the line item grew unnoticed. Attribution comes first; an unobserved cost cannot be managed.
Test 2 — cost decomposition by axis (5 minutes, small Python)
The first directional test. Decompose eval spend along the five multiplicative axes from the mechanism section and read which axis dominates.
import numpy as np
import pandas as pd
np.random.seed(42)
# Inline fixture — replace with your real usage log loader.
# Columns: queries, candidates, judges, refreshes_per_day, days,
# input_tokens_per_call, output_tokens_per_call.
config = {
"queries": 2_000,
"candidates_per_query": 20,
"judges_per_call": 3,
"refreshes_per_day": 4, # main + 3 feature branches
"days": 30,
"input_tokens_per_call": 3_000,
"output_tokens_per_call": 200,
"price_in": 15e-6, # $/token frontier judge
"price_out": 75e-6,
}
calls = (config["queries"]
* config["candidates_per_query"]
* config["judges_per_call"]
* config["refreshes_per_day"]
* config["days"])
input_cost = calls * config["input_tokens_per_call"] * config["price_in"]
output_cost = calls * config["output_tokens_per_call"] * config["price_out"]
total = input_cost + output_cost
print(f"calls/month: {calls:>14,}")
print(f"input cost : ${input_cost:>12,.0f}")
print(f"output cost: ${output_cost:>12,.0f}")
print(f"total : ${total:>12,.0f}")
# Sensitivity — what if we change one axis?
axes = ["candidates_per_query", "judges_per_call",
"refreshes_per_day", "input_tokens_per_call"]
for axis in axes:
saved = total * (1 - 1/2) # halving any one axis halves the bill
print(f"halve {axis:>24} saves ~${saved:,.0f}/mo")
# Expected output (illustrative):
# calls/month: 14,400,000
# input cost : $ 648,000
# output cost: $ 216,000
# total : $ 864,000
# halve candidates_per_query saves ~$432,000/mo
# ...
Healthy: no single axis explains more than 30% of overspend. Suspect: one axis dominates without a written justification (e.g., judges_per_call = 5 for ensembling where 3 suffices). Confirmed: two or more axes are in unjustified-multiplicative territory — the shape of every dominant eval-spend overrun.
Test 3 — caching potential and judge-tier waste (30 minutes, medium Python)
The decisive test. Two questions at once. How much eval spend is on pairs that have not changed since the last refresh? And what fraction of pairs are easy enough that a cheap judge would agree with the frontier judge? The second question is the cascadability test from cascade-style retrieval, applied to eval.
import numpy as np
import hashlib
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
np.random.seed(42)
# Inline fixture: simulate 4 nightly refreshes over the same eval set.
# Most pairs recur; a small fraction churn (corpus updates, eval edits).
base_pairs = [(f"q{i % 200}", f"d{j}") for i in range(2_000) for j in range(20)]
refresh_logs = []
for night in range(4):
churn = np.random.rand(len(base_pairs)) < 0.02 # 2% pair churn/night
pairs = [(q, d if not c else f"{d}_v{night}")
for (q, d), c in zip(base_pairs, churn)]
refresh_logs.append(pairs)
# Cache hit rate if we keyed (q, d, judge_version):
seen, hits, misses = set(), 0, 0
for night_pairs in refresh_logs:
for p in night_pairs:
h = hashlib.sha256(repr(p).encode()).hexdigest()
if h in seen: hits += 1
else: misses += 1; seen.add(h)
print(f"cache hit rate over 4 refreshes: {hits / (hits + misses):.1%}")
# Expected output: cache hit rate over 4 refreshes: 73.5%
# Cascadability: what fraction of pairs are "obvious"?
# Proxy: cosine similarity between query and doc text. Very high or very low
# similarity is easy; the middle band is the hard pairs that need the frontier.
queries = [f"query about topic {i % 50}" for i in range(2_000)]
docs = [f"document about topic {i % 50} extra noise" for i in range(2_000)]
vec = TfidfVectorizer(min_df=1).fit(queries + docs)
qv = vec.transform(queries); dv = vec.transform(docs)
sims = np.array([cosine_similarity(qv[i], dv[i])[0, 0] for i in range(2_000)])
easy = ((sims > 0.85) | (sims < 0.15)).mean()
hard = ((sims >= 0.40) & (sims <= 0.60)).mean()
middle = 1 - easy - hard
print(f"easy (cheap judge ok): {easy:.1%}")
print(f"middle: {middle:.1%}")
print(f"hard (frontier required): {hard:.1%}")
# Expected output (illustrative):
# easy (cheap judge ok): 61.4%
# middle: 32.0%
# hard (frontier required): 6.6%
Two numbers come out the other end. Cache hit rate quantifies savings from a (query, doc, judge_version) cache before any model changes. Easy fraction quantifies what remains for a cheap judge to handle in a cascade. Multiply: a 70% cache hit rate with 60% easy on the remainder makes 0.7 + 0.3 × 0.6 = 88% of frontier-judge calls avoidable through two structural changes with no accuracy loss on the calls cache and cascade handle.
In production, swap the TF-IDF cosine for the real embedding model and the simulated churn for actual log diffs. The shape of the answer is what matters.
Test 4 — cost-per-confirmed-bit as a monitorable scalar
Three tests above suit manual investigation. Ongoing monitoring needs a single scalar — dollars per bit of usable signal — eval spend over unique, decision-relevant judgments produced that week.
import numpy as np
# Inputs from the weekly eval pipeline.
eval_spend_usd = 9_400 # this week's judge bill
unique_pairs_scored = 18_000 # de-duped (query, doc) judged this week
decisions_made = 14 # release-blocking deltas the eval informed
cost_per_pair = eval_spend_usd / unique_pairs_scored
cost_per_decision = eval_spend_usd / decisions_made
print(f"$/pair scored: ${cost_per_pair:7.3f}")
print(f"$/decision made: ${cost_per_decision:7.0f}")
# Healthy: $/pair scored under $0.05, $/decision under $200
# Suspect: $/pair $0.05–$0.20, $/decision $200–$1000
# Confirmed: $/pair over $0.20, $/decision over $1000
Plot both weekly. The pair number catches mechanical waste — re-scoring, frontier-on-easy-pairs. The decision number catches the deeper waste of paying for resolution on signal nobody acts on. The decision number is usually worse than the pair number.
Worked example end-to-end
A typical eval-spend overrun, as it presents through the four tests.
Test 1 lands at an eval/inference ratio in the 1.0–2.0 range — eval has crossed production. Test 2 decomposition surfaces two axes carrying most of the cost: ensembling at 4–5 judges per call (where 3 would do), and refresh frequency multiplied across main plus 3–5 feature branches. Test 3 returns cache hit rate in the 70–85% band on the last four refreshes (the eval set is stable; identical pairs are scored nightly) and easy-fraction in the 55–70% band. Combined avoidable share lands at 85–95%. Test 4 reports $/pair in the $/decision in the
The treatment unwinds in fixed order. Caching ships first: 5–10× cost reduction on identical pairs, zero accuracy cost. Ensembling drops from 4–5 judges to 1–2 once an audit-sample agreement check confirms inter-rater agreement above 90%: another 2–3× with no measurable signal loss. Nightly traffic moves to the batch API for a free 2× on whatever remains. A cascade with a cheap judge first and a frontier on disagreement strips another 3–5×. Distillation queues for the quarter where $/decision plateaus and the bill is still meaningful.
The terminal state is eval at 10–15% of total LLM spend, an unchanged dashboard, unchanged metrics, and unchanged downstream decisions. The bill is roughly one tenth of its peak.
Treatment
Five interventions, ordered by saves-per-engineering-hour. Each is independent; ship in order, measure after each.
1. Cache by (query_hash, doc_hash, judge_version)
The cheapest win and the highest hit-rate. The same (query, doc) pair recurs across dashboard refreshes, nightly CI, and weekly regression runs. A read-through cache cuts spend 5–10× at zero accuracy cost.
import hashlib, json, redis
r = redis.Redis(decode_responses=True)
def cache_key(query, doc, judge_version):
h = hashlib.sha256(
f"{judge_version}\n{query}\n{doc}".encode()
).hexdigest()
return f"judge:{h}"
def judge(query, doc, judge_version="judge-v3"):
key = cache_key(query, doc, judge_version)
cached = r.get(key)
if cached is not None:
return json.loads(cached)["score"]
score = _call_llm_judge(query, doc, judge_version)
r.set(key, json.dumps({"score": score}),
ex=60 * 60 * 24 * 30) # 30-day TTL
return score
WHY this works: most eval sets are stable for weeks at a time, and the judge is deterministic at temperature 0 conditional on (prompt, model_version). The cache is mathematically free — repeated identical computation replaced by a lookup.
Tradeoff: cache invalidation on judge swaps. judge_version belongs in the key; a teacher swap must invalidate rather than silently return stale scores. A 30-day TTL drops entries that are cheaper to rebuild.
2. Drop ensembling unless it can be shown to change a decision
Most pipelines average 3–5 judges per call for “robustness”. Inter-rater agreement on calibrated rubrics typically lands above 95%, and the marginal cases where ensembling changed a release decision are rare enough to count by hand. Ensembling defaults are usually unjustified.
# Sanity check before dropping ensembling: pairwise agreement on a 200-pair
# sample. If >90%, drop to 1 judge with a 5% audit re-scored by a second judge
# for drift.
agreements = [(scores[:, i] == scores[:, j]).mean()
for i, j in [(0, 1), (0, 2), (1, 2)]]
WHY this works: variance reduction via averaging earns its cost only when per-call variance is large. On calibrated rubrics with engineered prompts, per-call variance is small; the law-of-large-numbers benefit does not compensate for the linear cost.
Tradeoff: the inter-rater reliability time series goes away. A 5% audit sample with multiple judges preserves enough signal to detect calibration drift without paying for the full ensemble.
3. Cascade — cheap judge then frontier on disagreement
Test 3 quantifies the easy fraction. A cheap judge — smaller model, faster frontier tier, or fine-tuned cross-encoder — handles easy pairs; the frontier handles only the close calls.
def cascade_judge(query, doc, threshold=5.0):
cheap_score, cheap_logit = cheap_judge_with_confidence(query, doc)
if abs(cheap_logit) > threshold:
return cheap_score
return frontier_judge(query, doc)
WHY this works: relevance is bimodal in practice. Most pairs are obvious to a model an order of magnitude smaller than the frontier; the frontier earns its cost only on the close-call minority. The cascade routes spend to where resolution matters.
Tradeoff: cheap-judge confidence must itself be calibrated, or the cascade lets through hard cases it should escalate. This is exactly the failure mode of cascade saturation — read before shipping, and instrument a small re-audit of cheap-resolved pairs against the frontier to detect gate drift.
4. Shrink the prompt before swapping the model
Prompt is the leverage point most teams skip. A 2,500–3,500-token prompt with elaborate chain-of-thought, few-shot examples, and a multi-page rubric typically does the same work as a 400–800-token prompt with a tight rubric. The model already knows what relevance is.
SHRUNK_PROMPT:
"Score relevance 0-3.
0=off-topic, 1=related, 2=partial answer, 3=direct answer.
Query: {q}
Doc: {d}
Score:"
# Validate against the original on a 200–500 pair held-out sample.
# Ship if Spearman > 0.93.
WHY this works: input tokens are roughly 80% of judge cost (long prompt, short reasoning). Halving the prompt halves the bill; a well-engineered rubric does not need few-shot scaffolding to get a frontier judge to agree with itself.
Tradeoff: prompt shrinkage is per-prompt and requires validation. Correlation against the original prompt on a 200–500 pair held-out sample is the gate. If correlation falls below 0.93, the deleted section was load-bearing.
5. Distill the judge into a 7B specialist
The largest single win and the highest-effort. From 3,000–10,000 cached (query, doc, teacher_score) triples — already produced by step 1 — a cross-encoder fine-tuned on the teacher’s outputs typically reaches above 95% agreement on held-out pairs. Per-call cost drops 50–200×.
Inputs : cached triples from step 1.
Train : cross-encoder (0.5B–7B base) on (q, d) -> teacher score (soft target).
Gate : Spearman > 0.93 on held-out AND no cluster regresses by > 3 pts.
Monitor: weekly re-validation against teacher; retrain if agreement < 85%.
WHY this works: the teacher is a general-purpose frontier model doing a narrow task it can be distilled into. The specialist learns the task, not the world. Quantization and inference-time batching push specialist cost down further, into the bracket where running it on every dashboard refresh is a rounding error.
Tradeoff: distillation introduces a new failure mode, distillation drift. The student agrees with the teacher on the training distribution and silently disagrees on the production distribution as production shifts. The teacher stays as the oracle for periodic re-validation; agreement on a weekly sample below 85% triggers retrain. Shipping the distilled judge without a drift monitor is not safe.
Move all non-live scoring to batch APIs
Free 2× on every step above. Most providers offer ~50% off for async batch APIs with a multi-hour SLA. Dashboards that refresh nightly do not need real-time scoring.
batch.create(requests=[
{"custom_id": f"q{i}", "model": "judge-small",
"messages": [{"role":"user","content": prompt(q, d)}],
"max_tokens": 100}
for i, (q, d) in enumerate(pairs)
])
WHY this works: pricing arbitrage. Same model, same prompt, same answer; the provider amortizes across off-peak capacity. No accuracy cost.
Tradeoff: latency. CI runs that block on the eval cannot use batch; gate-on-merge releases typically can. Move what fits; leave the rest on the real-time tier.
What does NOT work — and every team tries first
Picking a cheaper frontier model and calling it a day. Frontier price-per-token falls every six months, and teams ride that curve. This masks the pattern — the bill falls, the apparent problem resolves, the multiplicative cost structure is unchanged. Six months later, the eval set has grown, the candidate count has grown, the refresh frequency has grown, and the bill is back where it was on a cheaper model. The fix is structural, not procurement.
Eval cost is a five-factor product. Cache identical pairs, drop unjustified ensembling, cascade the easy ones to a cheap judge, shrink the prompt — then, and only then, distill. Each lever is cheap; the product of all five is two orders of magnitude.
This isn’t this pattern when…
| You observe… | This is probably… | Read next |
|---|---|---|
| Eval bill is lean, production inference is the line item | A frontier LLM is doing work a specialist could | Single-LLM overspend |
| Eval bill came down via distillation, then student disagrees with users | Student degraded on live distribution | Distillation drift |
| Eval is cheap, cascade gate is letting through cases it should escalate | Cheap-stage confidence calibration drifted | Cascade saturation |
| Eval is fine, but offline metrics climb while users complain | Eval set captured an old query distribution | Eval drift |
| Eval is fine, judge agrees with users, threshold breaks on model swap | Score-distribution shift under model change | Threshold by feel |
The disambiguation rule: eval-spend overrun is about the cost structure of the judge pipeline itself. The neighbors are about what the judge is judging, what happens after the judge is replaced, or whether the judge is judging the right distribution. Same domain (LLM-as-judge), different layer of the stack.
Numbers that matter
| signal | healthy | suspect | confirmed |
|---|---|---|---|
| eval / inference spend ratio | under 0.2 | 0.2–0.5 | over 0.5 |
| cache hit rate over 4 refreshes | over 70% with cache shipped | 40–70% (cache exists, churn high) | under 20% (no cache) |
| frontier-judge share of calls | under 30% | 30–70% | over 70% |
$/pair scored | under $0.05 | over $0.20 | |
$/decision informed | under $200 | over $1,000 |
These are starting thresholds. Healthy varies with eval-set heterogeneity and the number of releases the dashboard informs per week. A team with two releases a week and a 500-query eval has a very different healthy $/decision than one with daily releases and a 10,000-query eval.
Adjacent patterns
- Single-LLM overspend: the parent pattern. Eval-spend overrun is one specific class of “using a frontier LLM for work a specialist could do” — the work in this case being relevance scoring. The framing here is downstream of it.
- Distillation drift: the failure mode introduced by Treatment §5. The distilled judge agrees with the teacher on training data and silently degrades on production distribution as the distribution shifts. The drift-monitor recipe deploys alongside the distilled judge.
- Cascade saturation: the failure mode introduced by Treatment §3. The cheap-then-expensive cascade saves money until cheap-stage calibration drifts and the cascade silently lets through cases it should escalate. Same audit-sample posture as the distilled-judge case.
- Eval drift: the cousin failure. Eval-spend overrun is about paying too much for the eval; eval drift is about the eval measuring the wrong distribution. They co-occur often — pipelines that let the bill grow unchecked tend to be the same pipelines that let the eval set age unchecked.
If caching, ensembling-drop, cascading, and prompt-shrinking have shipped and the bill remains the line item — the pattern is one of the above neighbors, not this one.
