Playbooks

Production AI Failure Modes

Every production AI system is unique in detail: different stacks, different corpora, different traffic dsitributions, different cost constraints.

The failure modes it experiences aren't.

A small set of patterns recurs across teams, products, and verticals, each one with the same symptom, the same mechanism underneath, the same range of fixes. This catalog gives them names — written by the team behind the world's best embedding and reranker models.

Looking for atomic definitions instead? See /concepts/ — the vocabulary every pattern is built from.

The catalog 10 patterns
  1. I
    Our offline metrics keep climbing every week but our users keep saying it feels worse.
    Eval Drift

    The offline metric improves while user-signal degrades because the eval set was sampled at time T from a distribution that has since moved.

  2. II
    The retriever finds the right document and the model still gives wrong or hand-wavy answers.
    Right Doc, Wrong Answer

    Retrieval surfaces the relevant passage and generation contradicts or hand-waves past it; the failure is in the LLM-doc interaction, not in retrieval.

  3. III
    We picked the score threshold because the number looked round, and now it's quietly broken.
    Threshold by Feel

    A score cutoff was picked because the number looked round, not because it came from a precision/recall target — it works until the upstream model changes, and then it breaks invisibly.

  4. IV
    Our LLM-as-judge bill is bigger than our production inference bill.
    Eval-Spend Overrun

    LLM-as-judge becomes the dominant inference line item because every dashboard refresh re-scores the same (query, doc) pairs against the same teacher.

  5. V
    Our inference cost has crept up and most of the calls aren't even doing the part the LLM is uniquely good at.
    Single-LLM Overspend

    A frontier LLM is used for work a small specialist could do at roughly 1/100x cost — relevance scoring, faithfulness checking, intent classification — because the work in front of it was never attributed.

  6. VI
    We picked the best embedding model on the leaderboard and can't get past a ceiling on our own data.
    Embedding Plateau

    An off-the-shelf embedder hits a ceiling on a domain whose vocabulary and relevance semantics are outside web text; continuing to fine-tune that one model is the wrong move.

  7. VII
    The right document is in the context window and the model ignores it anyway.
    Context Rot

    The relevant document is placed in mid-context and the LLM under-attends to it; the fix is order-of-context, not which document was retrieved.

  8. VIII
    The reranker is accurate but it's eating our latency budget.
    Reranker on the Request Path

    A cross-encoder reranker's joint attention burns 300 ms or more on the user-visible request; the fix is structural — cascade, batch, hide behind LLM prefill — not model substitution.

  9. IX
    Our distilled judge still passes the held-out eval but stopped behaving in production.
    Distillation Drift

    A distilled judge or scorer holds its agreement with the teacher on training data and silently degrades on the production distribution as that distribution shifts.

  10. X
    We deployed a cheap-then-expensive cascade. The cost graph looks great. Something feels off.
    Cascade Saturation

    A cheap-then-expensive cascade saves money until the cheap stage's confidence calibration itself drifts, and the cascade silently lets through cases it should escalate.

About the catalog

Each entry names a class of production failure we've seen often enough to give it a name. The point is shared vocabulary: "we've got eval drift," "that's threshold-by-feel," "single-LLM overspend" — the kind of thing a senior engineer should be able to say in a meeting and have the team know exactly what's meant. If a pattern you keep hitting isn't in the catalog yet, tell us .

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord