Pattern · V 12 min read 8 sections 7 code samples Updated May 17, 2026
This pattern is called

Single-LLM Overspend

A frontier LLM is used for work a small specialist could do at roughly 1/100x cost — relevance scoring, faithfulness checking, intent classification — because the work in front of it was never attributed.

Symptom

Single-LLM overspend is the architectural signature of a retrieval pipeline that reaches for a frontier model at every step. The same model name appears four times in a single request trace: query rewrite, retrieval scoring, generation, faithfulness check. The product surface hasn’t moved; the inference bill has, by an order of magnitude over six months.

“Our inference cost has crept up and most of the calls aren’t even doing the part the LLM is uniquely good at.”

The surface signals are consistent across production systems:

  • One frontier-model account exceeds 80% of inference spend. Not a portfolio of model tiers — one model, one provider, every step.
  • Mean tokens-per-request runs well above 10k. Most of those tokens land in retrieval-scoring and faithfulness calls that re-stuff the context window each pass, not in generation.
  • Per-request latency is dominated by serial LLM calls. The waterfall is four stacked bars labeled with the same model.
  • “Just swap to a smaller model” is a running joke that never ships. Nobody can tell which step would break.
  • A fifth LLM call lands in the request path with no friction. Frontier-model-as-default has become muscle memory.

Each call looks load-bearing in isolation. Only one is actually using the part of the model the team is paying for — open-ended generation from a clean context. The other three apply frontier-model rates to closed-form work a 0.6B specialist handles at roughly 1/100x cost and with better calibration.

Mechanism

Production AI is a constellation of small specialists wrapped around one frontier LLM, not one frontier LLM doing every step. Retrieval-augmented systems arrive at the opposite architecture by accident — by reaching for the LLM as the only available tool at each step where a specialist hasn’t been built yet.

The work that should be done by a specialized small model:

  • Relevance scoring — a . Frontier LLMs emit free-text confidence statements that don’t compose into a threshold; specialists emit calibrated scalars.
  • Query rewriting and expansion — a small distilled model trained on (raw, rewritten) pairs covers the in-domain long tail at a fraction of the cost.
  • Intent classification and routing — a 100M-parameter classifier is faster, cheaper, and calibrated for multi-class probabilities.
  • Faithfulness and grounding checks — a classifier trained on (claim, evidence, supported) triples is the right shape. A frontier LLM re-reading the context window to emit a free-text judgment is overkill.
  • Relevance labeling for offline eval — a distilled . The teacher LLM runs once at training time; production scoring is done by the distilled student.

What remains for the frontier LLM after specialization is one thing: answer generation from a clean, ranked, scored, calibrated context window. That is the open-ended work no specialist beats. Everything else is closed-form and belongs in a model trained for it.

A worked numerical cost decomposition

A representative RAG request — retrieve top-50, LLM-rerank to top-5, LLM-generate, LLM-faithfulness-check — at frontier-tier prices (input around 15 / 1k) decomposes as:

StepTokens inTokens outModelCost
Query rewrite20050frontier$0.0014
LLM rerank top-5012,000200frontier$0.039
Generate answer3,500400frontier$0.016
LLM faithfulness4,000100frontier$0.014
Per-request total$0.070

The generate call — the only step that requires frontier capability — is roughly 23% of per-request cost. The other 77% is closed-form work charged at frontier rates. At a million requests a month, the gap on this single trace shape is 16k on the generate step alone. The architecture pays frontier-model rates for sub-frontier work.

The pattern is endemic because the frontier LLM works at every step on day one with zero investment in specialization. Cost compounds invisibly over months of scale. By the time the invoice forces the conversation, the frontier model is wired into four steps it shouldn’t touch, and the muscle memory is to reach for it on the fifth.

Diagnostic

Four tests, ordered cheap-to-expensive. Run in order; stop when one fires.

Test 1 — model-concentration ratio (under 1 minute)

The cheapest signal: what fraction of last month’s inference spend lands on a single frontier model? The number comes from the provider’s billing console or an internal cost-attribution dashboard.

concentration on one frontier modelreading
under 40%healthy; portfolio is already diversified
40–70%suspect; run Test 2
over 70%almost certainly firing — proceed to Test 3

A specialized constellation spreads spend across model classes — frontier for generate, mid-tier for routing, specialist endpoints for rerank and faithfulness. A bill concentrated on one frontier model is the architectural signature of this pattern.

Test 2 — step-level token attribution (5–15 minutes)

Instrument the request path. Each LLM call records which step it serves, input and output tokens, and the downstream consumer. Aggregate across a representative week.

import dataclasses
from collections import defaultdict
import numpy as np

np.random.seed(42)

# Per-1k-token prices (representative; refresh from your provider).
PRICES = {
    "frontier":   {"in": 3.00 / 1000, "out": 15.00 / 1000},
    "mid":        {"in": 0.30 / 1000, "out": 1.20 / 1000},
    "mini":       {"in": 0.05 / 1000, "out": 0.20 / 1000},
    "specialist": {"in": 0.02 / 1000, "out": 0.02 / 1000},
}

@dataclasses.dataclass
class LLMCall:
    step: str       # "rewrite" | "rerank" | "generate" | "faithfulness"
    model: str      # "frontier" | "mid" | "mini" | "specialist"
    tok_in: int
    tok_out: int

# Inline fixture — replace with a week of real traces.
trace = [
    LLMCall("rewrite",      "frontier",   200,   50),
    LLMCall("rerank",       "frontier", 12000,  200),
    LLMCall("generate",     "frontier",  3500,  400),
    LLMCall("faithfulness", "frontier",  4000,  100),
] * 1000  # 1000 representative requests

by_step = defaultdict(lambda: {"cost": 0.0, "calls": 0})
for c in trace:
    p = PRICES[c.model]
    by_step[c.step]["cost"] += c.tok_in * p["in"] + c.tok_out * p["out"]
    by_step[c.step]["calls"] += 1

total = sum(s["cost"] for s in by_step.values())
print(f"{'step':<14} {'calls':>6} {'$ share':>10}")
for step, s in sorted(by_step.items(), key=lambda x: -x[1]["cost"]):
    print(f"{step:<14} {s['calls']:>6} {s['cost']/total:>9.1%}")
print(f"{'TOTAL':<14} {len(trace):>6} {total:>9.2f}")
# Expected output:
#   step            calls    $ share
#   rerank           1000      55.7%
#   generate         1000      22.9%
#   faithfulness     1000      20.0%
#   rewrite          1000       1.3%
#   TOTAL            4000      70.30

Any step other than generate exceeding roughly 10% of per-request cost is a specialization candidate. In production, rerank and faithfulness each typically run 15–35% of per-request cost, with generate underweighted at 20–40%. That distribution is exactly inverted from what it should be.

Test 3 — call-trace cost-attribution by stage (30+ minutes)

The decisive test. A real week of production traces, with per-step cost computed twice: once at current rates, once at a specialized counterfactual where rerank, faithfulness, rewrite, and routing price at specialist-tier rates. The delta is the specialization headroom — concrete, dollar-denominated, defensible to a finance partner.

import numpy as np
import pandas as pd
from collections import defaultdict

np.random.seed(42)

# Same price table as Test 2.
PRICES = {
    "frontier":   {"in": 3.00 / 1000, "out": 15.00 / 1000},
    "specialist": {"in": 0.02 / 1000, "out": 0.02 / 1000},
}

# Specialization map: which step can be replaced by what tier?
# Generate stays frontier; everything else routes to a specialist.
SPECIALIZE = {
    "rewrite":      "specialist",
    "rerank":       "specialist",
    "generate":     "frontier",
    "faithfulness": "specialist",
}

# Inline fixture: 5000 requests, varying token volumes.
def synth_trace(n=5000):
    rows = []
    for r in range(n):
        rows += [
            ("rewrite",      200 + np.random.poisson(30),  50  + np.random.poisson(10)),
            ("rerank",     11000 + np.random.poisson(800), 200 + np.random.poisson(40)),
            ("generate",    3000 + np.random.poisson(500), 400 + np.random.poisson(60)),
            ("faithfulness",3800 + np.random.poisson(300), 100 + np.random.poisson(20)),
        ]
        for step, ti, to in rows[-4:]:
            yield r, step, ti, to

df = pd.DataFrame(synth_trace(), columns=["req", "step", "tok_in", "tok_out"])

def cost_at(tier, ti, to):
    p = PRICES[tier]
    return ti * p["in"] + to * p["out"]

df["frontier_cost"]    = df.apply(lambda r: cost_at("frontier", r.tok_in, r.tok_out), axis=1)
df["specialized_cost"] = df.apply(lambda r: cost_at(SPECIALIZE[r.step], r.tok_in, r.tok_out), axis=1)

agg = df.groupby("step").agg(
    n=("req", "count"),
    cost_now=("frontier_cost", "sum"),
    cost_specialized=("specialized_cost", "sum"),
)
agg["share_now"]         = agg["cost_now"]         / agg["cost_now"].sum()
agg["share_specialized"] = agg["cost_specialized"] / agg["cost_specialized"].sum()
agg["savings"]           = agg["cost_now"] - agg["cost_specialized"]

total_now    = agg["cost_now"].sum()
total_after  = agg["cost_specialized"].sum()
reduction    = 1 - total_after / total_now
specialist_eligible = agg.loc[agg.index != "generate", "cost_now"].sum() / total_now

print(agg[["n", "cost_now", "cost_specialized", "share_now", "share_specialized", "savings"]].round(2))
print(f"\nTotal now:         ${total_now:,.2f}")
print(f"Total specialized: ${total_after:,.2f}")
print(f"Cost reduction:    {reduction:.1%}")
print(f"Specialist-eligible spend fraction: {specialist_eligible:.1%}")
# Expected output (illustrative):
#                  n  cost_now  cost_specialized  share_now  share_specialized   savings
#   step
#   faithfulness  5000  61257.00            390.00      0.20              0.01  60867.00
#   generate      5000  73500.00          73500.00      0.24              0.99      0.00
#   rerank        5000 165180.00            900.00      0.55              0.01 164280.00
#   rewrite       5000   4800.00             30.00      0.02              0.00   4770.00
#
#   Total now:         $304,737.00
#   Total specialized: $74,820.00
#   Cost reduction:    75.5%
#   Specialist-eligible spend fraction: 75.9%

The number to report is specialist-eligible spend fraction — the share of inference dollars flowing to non-generate steps. Below 20% the pattern is unlikely to be the dominant cost driver. Above 50% the pattern is the dominant cost driver and the architecture is leaving an order of magnitude on the table.

Test 4 — specialist-eligible spend fraction as a monitorable scalar

The first three tests are for manual investigation. For ongoing monitoring, one scalar:

def specialist_eligible_fraction(df):
    """df has columns: step, frontier_cost. Returns fraction in non-generate steps."""
    total = df["frontier_cost"].sum()
    non_gen = df.loc[df["step"] != "generate", "frontier_cost"].sum()
    return non_gen / total

frac = specialist_eligible_fraction(df)
print(f"Specialist-eligible spend fraction: {frac:.1%}")
# Expected: ~0.76 on the fixture above.

Healthy: under 0.20. Suspect: 0.20 – 0.50. Alert: over 0.50. Plot weekly; alarm on threshold crossings; specialize the heaviest non-generate step first when the alarm fires.

Worked example end-to-end

A production retrieval system running this diagnostic on a representative week of multi-million-request traffic produces a recognizable trajectory.

Test 1. Provider billing concentrates 70–90% of inference spend on one frontier model — already enough to proceed, with the rest of the diagnostic clarifying where the spend lands.

Test 2. Step-level attribution typically lands with rerank at 40–55%, generate at 20–30%, faithfulness at 15–25%, rewrite and miscellaneous absorbing the remainder. Two or three non-generate steps clear the 10% threshold. The architecture is paying frontier rates to score retrieval and check grounding.

Test 3. A specialized counterfactual replaces rerank, faithfulness, and rewrite with specialist-tier endpoints and holds generate at frontier. Specialist-eligible spend fraction lands in the 60–80% band for systems exhibiting this pattern, with projected cost reductions of 60–80% on the inference bill — almost entirely on steps the generate path never touches.

Test 4. Wired into a cost dashboard, the scalar typically reveals a steady climb — traffic outgrows any architectural compensation, and the fraction drifts upward over quarters.

Treatment §1 (specialist reranker) is the first ship and typically takes the specialist-eligible fraction from 0.6–0.8 down to 0.1–0.3 in a single release. Treatment §3 (distilled faithfulness classifier) trims another 10–20 points. Post-specialization, generate becomes 70–90% of inference spend — the shape the architecture should have had from the start. Per-request latency drops materially (three serial frontier calls collapse into one), and quality on the generate step holds because that step was never touched.

Treatment

Specialization is four moves, each independent, each unlocking the next. Order matters: cheap-to-build specializations unlock the expensive-to-build ones, and impact compounds.

1. Reranker — replace LLM-rerank with a cross-encoder

A trained on roughly 10k labeled (query, doc, relevance) triples replaces every LLM-based relevance scoring step. Cost drops by roughly 100x, latency from 100+ms to sub-20ms, and — load-bearing — calibration becomes possible. The reranker emits a thresholdable score (see threshold by feel). The frontier LLM emits free-text that defeats caching and threshold-setting both.

# Drop-in replacement for "ask the LLM to score top-50 docs for relevance".
# In production: swap for zerank-2, BGE-reranker, Cohere rerank, etc.
scores = reranker.score(query, candidates)  # calibrated scalars in [0, 1]
top_k = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]

Why this first. Rerank is the single largest non-generate spend in almost every production stack. Labeled-triple cost is low — teacher-LLM labels on a sampled candidate set bootstrap the training set. The replacement is a drop-in: same input shape, same output shape, 100x cheaper. The tradeoff: cross-encoders need re-training when the corpus drifts substantially, manageable with the per-cluster monitoring from eval drift.

If only one move ships from this playbook, it is this one.

2. Embeddings plus retrieval — replace LLM-document-scoring with vector retrieval plus reranker

When the architecture uses an LLM for first-pass document scoring — asking the model “which of these N docs is relevant?” — the entire stage should be replaced by plus the reranker from §1. The LLM should never see the document corpus at scoring time.

# Two-stage: dense retrieval narrows N=1M -> 50, specialist reranker scores 50.
# The frontier LLM only sees the top-5 after reranking.
top_50 = embed_index.search(query, k=50)
top_5  = reranker.rerank(query, top_50, k=5)
answer = frontier_llm.generate(query, context=top_5)

Why this is non-negotiable. A frontier LLM doing first-pass relevance scoring is the most expensive way to compute a dot product. Vector retrieval at scale is microseconds per query; the LLM call to rank candidates is hundreds of milliseconds and dollars per thousand requests. Embedding plus retrieval plus rerank is the correct three-stage shape; LLM-only retrieval is the architectural anti-pattern this playbook exists to name. Tradeoff: embeddings need refreshing when the corpus changes, and rare-vocabulary queries underperform — fix with (dense plus ).

3. Distilled judge — replace frontier-LLM judges in offline eval and faithfulness

When dominates eval-spend, or when a faithfulness LLM call sits on the request path, distill the teacher into a small judge trained on (query, doc, teacher_score) triples. Roughly 5k cached triples reach 95%+ agreement with the teacher on held-out pairs. Ongoing scoring costs roughly 1/100x per call.

# Distillation: teacher labels once, student serves forever.
teacher_scores = teacher_llm.score_batch(triples)   # one-time, offline
student = train_cross_encoder(triples, teacher_scores)
# Production: student.score(query, doc) replaces teacher_llm.score(query, doc)

Why this matters across the pipeline. Distilled judges power offline eval, faithfulness scoring at request time, and active-learning labeling for the reranker from §1. One distillation effort feeds three different specialists. The teacher is kept as the oracle — re-validated weekly on a small random sample to detect distillation drift. Tradeoff: the student degrades silently on out-of-distribution queries; the weekly oracle re-check is mandatory.

4. Cheap-then-expensive cascade — escalate only when the cheap stage is unsure

Once §1–§3 are in place, the next move is a confidence-aware cascade: route obvious cases to the specialist, escalate to the frontier LLM only when the specialist’s calibrated score sits in a defined uncertainty band.

# Cascade: specialist handles the confident cases; LLM picks up the uncertain band.
LO, HI = 0.35, 0.75
result = specialist.score(query, doc)
if LO <= result.confidence <= HI:
    result = frontier_llm.score(query, doc)

Why last in the order. A cascade only saves money if the cheap stage is calibrated — Treatments §1 and §3 are what produce a thresholdable score. Building a cascade on top of an uncalibrated reranker is the failure mode named in cascade saturation: the cheap stage silently passes cases it should escalate, and the stated savings evaporate. Tradeoff: cascades add a confidence-band hyperparameter that needs per-cluster tuning when the distribution shifts.

5. Quantization — last-mile cost lever on the specialists

After §1–§4, the specialist models themselves can be quantized — int8 or int4 weights, sometimes with after quantization to restore score distributions. Quantization buys a 2–4x latency reduction and a similar throughput-per-GPU multiplier. It is not a substitute for specialization — a quantized frontier LLM still costs more than a specialist — it is a multiplier on top of it.

Quantizing the frontier LLM itself is rarely the team’s lever; the provider has already done it. Quantization is a specialist-side optimization.

What does not work — and is tried first anyway

  • “Just switch from Opus to Sonnet.” Same architecture, smaller frontier model, roughly 3x cost reduction. Doesn’t address the structural problem: the system still pays LLM rates for non-LLM work. Quality regresses on the generate step (the one that needed the larger model) while rerank, faithfulness, and rewrite continue running unchanged. The A/B test consumes a quarter and the metric doesn’t move because the bottleneck isn’t model size.
  • “Cache the LLM responses.” Generation outputs are too varied to cache meaningfully. Rerank and faithfulness could be cached, but the LLM emits free-text that’s lexically different across runs. Specialists emit deterministic scores that compose with caches — once the specialists exist.
  • “Reduce the prompt size.” Halving the prompt halves the cost of one call. Specializing the call eliminates 99% of it.
  • “Add a budget alert.” Slows the curve. Doesn’t bend it.

LLMs are general; production has specialist work; pay the specialist. The frontier LLM does one thing nothing else does — open-ended generation from clean context. Everything in front of it is closed-form: rank these, score this, classify that, check this against that. Closed-form work belongs in a model trained for it. The architecture is a constellation of small specialists wrapped around one frontier LLM, not one frontier LLM doing every step.

This isn’t this pattern when…

ObservationProbable patternRead next
Bill is high, but spend is already spread across model tiersFrontier-model selection issue, not architecturaleval-spend overrun
Reranker exists, but it sits on the request path adding 300msStructural placement, not specializationreranker on the request path
Specialists exist, but agreement with the teacher has drifted in productionStudent–teacher driftdistillation drift
Cascade is in place but no longer saves moneyCheap stage’s calibration driftedcascade saturation
Bill is high because every dashboard refresh re-scores the eval setEval cost, not inference costeval-spend overrun
Generate-step quality dropped after a model swapCalibration regression on thresholdthreshold by feel

Disambiguation rule of thumb: single-LLM overspend is about which model class is doing the work. Reranker-on-the-request-path is about where the specialist sits in the call graph. Distillation drift is about whether the specialist still agrees with its teacher. Same end-state symptom — bill up, quality flat — different root cause.

Numbers that matter

signalhealthysuspectconfirmed
concentration on one frontier modelunder 40%40–70%over 70%
specialist-eligible spend fractionunder 20%20–50%over 50%
number of frontier LLM calls per request123+
per-step cost share of rerank or faithfulnessunder 5%5–15%over 15%
generate-step share of total request costover 70%40–70%under 40%
mean tokens-per-requestunder 5k5–15kover 15k

The single most actionable scalar is specialist-eligible spend fraction — plot it weekly, alarm when it crosses 0.3, treat any crossing as a planning trigger for the next specialization.

Adjacent patterns

  • Reranker on the request path — the next failure mode after shipping Treatment §1. Once the specialist reranker exists, the question becomes where in the call graph it sits, and a cross-encoder that adds 300ms to user-visible latency is the canonical placement bug.
  • Distillation drift — the failure mode created by Treatment §3. The student holds agreement with the teacher on the training distribution and silently degrades as production distribution shifts. Worth knowing before the distillation ships.
  • Cascade saturation — the failure mode created by Treatment §4. A cheap-then-expensive cascade saves money until the cheap stage’s calibration drifts and stops escalating cases it should. Stated savings evaporate quietly.
  • Eval-spend overrun — the eval-side mirror of this pattern. Same thesis applied to LLM-as-judge in the eval loop rather than to LLM calls in the request path.

This pattern is the parent of the three downstream patterns above. Specializing creates them. The cost of doing the right thing is that the failure modes get more interesting, not fewer. The catalog is structured around that fact.

The team writing this ZeroEntropy trains specialized small models (zembed-1, zerank-2) for the production stacks where these patterns show up.
ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord