Production AI Failure Modes
Every production AI system is unique in detail: different stacks, different corpora, different traffic dsitributions, different cost constraints.
The failure modes it experiences aren't.
A small set of patterns recurs across teams, products, and verticals, each one with the same symptom, the same mechanism underneath, the same range of fixes. This catalog gives them names — written by the team behind the world's best embedding and reranker models.
Looking for atomic definitions instead? See /concepts/ — the vocabulary every pattern is built from.
- I
Our offline metrics keep climbing every week but our users keep saying it feels worse.
Eval DriftThe offline metric improves while user-signal degrades because the eval set was sampled at time T from a distribution that has since moved.
- II
The retriever finds the right document and the model still gives wrong or hand-wavy answers.
Right Doc, Wrong AnswerRetrieval surfaces the relevant passage and generation contradicts or hand-waves past it; the failure is in the LLM-doc interaction, not in retrieval.
- III
We picked the score threshold because the number looked round, and now it's quietly broken.
Threshold by FeelA score cutoff was picked because the number looked round, not because it came from a precision/recall target — it works until the upstream model changes, and then it breaks invisibly.
- IV
Our LLM-as-judge bill is bigger than our production inference bill.
Eval-Spend OverrunLLM-as-judge becomes the dominant inference line item because every dashboard refresh re-scores the same (query, doc) pairs against the same teacher.
- V
Our inference cost has crept up and most of the calls aren't even doing the part the LLM is uniquely good at.
Single-LLM OverspendA frontier LLM is used for work a small specialist could do at roughly 1/100x cost — relevance scoring, faithfulness checking, intent classification — because the work in front of it was never attributed.
- VI
We picked the best embedding model on the leaderboard and can't get past a ceiling on our own data.
Embedding PlateauAn off-the-shelf embedder hits a ceiling on a domain whose vocabulary and relevance semantics are outside web text; continuing to fine-tune that one model is the wrong move.
- VII
The right document is in the context window and the model ignores it anyway.
Context RotThe relevant document is placed in mid-context and the LLM under-attends to it; the fix is order-of-context, not which document was retrieved.
- VIII
The reranker is accurate but it's eating our latency budget.
Reranker on the Request PathA cross-encoder reranker's joint attention burns 300 ms or more on the user-visible request; the fix is structural — cascade, batch, hide behind LLM prefill — not model substitution.
- IX
Our distilled judge still passes the held-out eval but stopped behaving in production.
Distillation DriftA distilled judge or scorer holds its agreement with the teacher on training data and silently degrades on the production distribution as that distribution shifts.
- X
We deployed a cheap-then-expensive cascade. The cost graph looks great. Something feels off.
Cascade SaturationA cheap-then-expensive cascade saves money until the cheap stage's confidence calibration itself drifts, and the cascade silently lets through cases it should escalate.
Each entry names a class of production failure we've seen often enough to give it a name. The point is shared vocabulary: "we've got eval drift," "that's threshold-by-feel," "single-LLM overspend" — the kind of thing a senior engineer should be able to say in a meeting and have the team know exactly what's meant. If a pattern you keep hitting isn't in the catalog yet, tell us .
