Embedding Plateau
An off-the-shelf embedder hits a ceiling on a domain whose vocabulary and relevance semantics are outside web text; continuing to fine-tune that one model is the wrong move.
Symptom
A frontier embedding model — openai-text-embedding-3-large, voyage-3, the current MTEB leader — ships on day one and lands an NDCG@10 of roughly 0.40 to 0.50 on a domain corpus. Weeks of hyperparameter sweeps, deeper chunks, query rewrites, and second-embedder ensembles fail to move the number.
“We picked the best embedding model on the leaderboard and can’t get past a ceiling on our own data.”
The pattern, concretely:
- The leaderboard pick stalls at a domain-specific ceiling. The MTEB top model places mid-pack on the team’s eval; the gap to the second-best off-the-shelf model is under one point. Vendor swaps move the metric by noise.
- Recall is the limiting factor, not ranking. Recall@100 is the variable that won’t budge. The relevant doc is absent from the candidate set; no reranker on top can recover what was never there.
- Failures cluster on queries with literal tokens. Statute citations, ticker symbols, function names, ICD codes — anywhere the user typed a domain-specific string and expected the document containing that exact string. The embedder treats the string as noise.
- Hyperparameter sweeps move nothing. Chunk size, overlap, pooling strategy, query prefix — every runtime knob explores a flat region.
The instinct that the model is undertuned is wrong. It is out of distribution. Off-the-shelf embedders trained on web text place case law, clinical notes, internal codebases, 10-K filings, and domain-shorthand customer-support tickets outside their training distribution; performance ceilings on these corpora are typical, not exceptional. The plateau is the model meeting the edge of what it can know.
Mechanism
A dense embedding model maps text into a vector space so that semantically similar passages land close together under cosine similarity or dot product. “Semantically similar” is a property of the training distribution. On web text — Common Crawl, Wikipedia, mined question-answer pairs — it means roughly “these two passages discuss the same topic in mostly-shared vocabulary.” On a domain corpus, both the things-discussed and the vocabulary diverge:
- Domain-specific homographs. “trust” in legal vs consumer sense. “positive” in clinical vs sentiment sense. “order” in finance vs general. The embedder collapses them because in web text they mostly collapse.
- Structural signals web text doesn’t have. Statute citations (
28 USC § 1331), ICD codes (F33.1), ticker symbols (MSFT), function signatures, SKUs. The embedder treats these as low-frequency noise and discards their information at the pooling layer. - A different what-counts-as-relevant. A legal user’s “relevant” is “this statute applies”; a clinician’s “relevant” is “this is the same diagnosis at the same stage.” Web-text relevance is “these are about the same topic.” The gap is everything.
The compounding effect shows up in rank order. On domain queries with literal tokens, an off-the-shelf dense embedder and a BM25 branch over the same corpus routinely disagree by 30 to 60 positions for the same relevant document — dense places it deep in the tail, BM25 surfaces it near the top. The lexical signal was always there; the dense model discarded it at training time. Aggregate this over an eval and the dense branch alone reads NDCG@10 around 0.45, BM25 alone around 0.40, their RRF fusion around 0.58 — 10 to 15 NDCG points sitting in the lexical channel until someone connects it.
Continuing to fine-tune a single dense embedder on these symptoms hits diminishing returns fast. The model class has structural trouble combining a literal vocabulary signal (citations, codes, SKUs) with a semantic similarity signal inside one fixed-dimensional vector. The fix is not a better dense model; it is a different shape of system.
Diagnostic
Four tests, ordered cheap-to-expensive. Run in order; stop at the first conclusive fire.
Test 1 — vocabulary overlap with web text (under 1 minute, one shell command)
Mixed-alphanumeric token share is a cheap proxy for how far a corpus has drifted from common-crawl-style distributions. Codes, IDs, and symbols all surface as tokens containing both letters and digits; their share rises sharply on out-of-distribution corpora.
# Run against a sample of the corpus. Replace `corpus_sample.txt` with
# 1000 randomly-sampled docs concatenated, one per line.
awk '{ for(i=1;i<=NF;i++) print tolower($i) }' corpus_sample.txt \
| grep -E '^[a-z0-9]+$' \
| awk '/[0-9]/ && /[a-z]/ { mixed++ } { total++ } END {
printf "mixed-alphanumeric token share: %.1f%%\n", 100*mixed/total
}'
| mixed-alphanumeric share | reading |
|---|---|
| under 2% | corpus looks like web text; this pattern is unlikely to dominate |
| 2 to 8% | suspect; run Test 2 |
| over 8% | almost certainly firing; the corpus has structural tokens the embedder discards |
Single number, no Python required. Catches the obvious cases — codebases, clinical text, finance — on its own.
Test 2 — does adding BM25 close the gap? (5 to 15 minutes, small Python)
The directional test. Build a pure BM25 branch on the same corpus, fuse with the dense branch via reciprocal rank fusion, measure the delta. Self-contained fixture below; swap TfidfVectorizer for the production embedder when running for real, and swap the inline docs for the eval set.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict
np.random.seed(42)
# Inline fixture standing in for a domain corpus + eval set.
# In production, replace with actual docs and labeled queries.
docs = [
"Bulk Cache Eviction API: purge entries by org_id and key prefix",
"Single-key cache invalidation via DELETE /cache/{key}",
"Configure SAML SSO for enterprise tier",
"Audit log retention policies for SOC2 compliance",
"Rate limits per API key on the free tier",
"Org-scoped role-based access control: admin, member, viewer",
"Function-calling tool schema reference for agent runs",
"Background job scheduler with cron syntax",
"Webhooks: signature verification with HMAC-SHA256",
"Distributed transaction coordinator: two-phase commit",
]
queries = [
"how do I batch-invalidate the cache for an org",
"SAML config",
"free tier rate limit",
"agent run tool schema",
"webhook signature check",
]
# Ground truth: which doc index is relevant for each query.
relevant = [0, 2, 4, 6, 8]
# Dense branch: TfidfVectorizer as a pyodide-safe stand-in for a real
# embedder. In production, swap for a dense model's encode().
vec = TfidfVectorizer(min_df=1, stop_words="english").fit(docs + queries)
doc_emb = vec.transform(docs).toarray()
q_emb = vec.transform(queries).toarray()
def dense_rank(q_vec):
sims = cosine_similarity(q_vec.reshape(1, -1), doc_emb)[0]
return list(np.argsort(-sims))
# BM25 branch: classic Robertson-Sparck-Jones formula, k1=1.5, b=0.75.
def tokenize(s): return [t.lower() for t in s.split() if t.isalnum() or "-" in t]
N = len(docs)
df = defaultdict(int)
doc_tokens = [tokenize(d) for d in docs]
for toks in doc_tokens:
for t in set(toks): df[t] += 1
avgdl = np.mean([len(t) for t in doc_tokens])
k1, b = 1.5, 0.75
def bm25_rank(query):
q_toks = tokenize(query)
scores = np.zeros(N)
for i, toks in enumerate(doc_tokens):
dl = len(toks)
for t in q_toks:
if t not in df: continue
idf = np.log((N - df[t] + 0.5) / (df[t] + 0.5) + 1)
tf = toks.count(t)
scores[i] += idf * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))
return list(np.argsort(-scores))
def rrf_merge(rankings, k=60):
fused = defaultdict(float)
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
fused[doc_id] += 1.0 / (k + rank)
return sorted(fused.keys(), key=lambda d: -fused[d])
def rank_of_relevant(ranking, target):
return ranking.index(target) + 1
dense_ranks, bm25_ranks, fused_ranks = [], [], []
for q, rel in zip(queries, relevant):
d = dense_rank(q_emb[queries.index(q)])
b25 = bm25_rank(q)
f = rrf_merge([d, b25])
dense_ranks.append(rank_of_relevant(d, rel))
bm25_ranks.append(rank_of_relevant(b25, rel))
fused_ranks.append(rank_of_relevant(f, rel))
print(f"mean rank of relevant doc:")
print(f" dense only: {np.mean(dense_ranks):.2f}")
print(f" bm25 only: {np.mean(bm25_ranks):.2f}")
print(f" rrf fused: {np.mean(fused_ranks):.2f}")
# Expected output on the fixture:
# mean rank of relevant doc:
# dense only: 2.40
# bm25 only: 1.40
# rrf fused: 1.20
Healthy: fused is within 1 to 2 points NDCG of dense-only — the dense model is near its real ceiling, and treatment moves to the training side. Suspect: fused is +3 to +5 NDCG over dense-only. Confirmed: fused is +5 to +10 NDCG over dense-only — the dense model is discarding lexical signal that BM25 captures for free, and the cheapest mitigation is already in hand.
Test 3 — does the lift concentrate on rare-vocabulary queries? (30+ minutes, medium Python)
The decisive test. If hybrid lifts the metric, the location of the lift dictates treatment order. Bucket eval queries by whether they contain a domain-specific token — a string rare or absent in standard English — and compare dense-only vs fused NDCG@10 per bucket.
import numpy as np
from collections import Counter
np.random.seed(42)
# Reuse `queries`, `docs`, `relevant`, `dense_rank`, `bm25_rank`,
# `rrf_merge`, `rank_of_relevant` from Test 2.
# A "rare-vocabulary" token is one whose document frequency in the corpus
# is at or below the 20th percentile of all corpus-token document
# frequencies. Cheap proxy that needs no external corpus.
all_tokens = [t for toks in doc_tokens for t in set(toks)]
df_counts = Counter(all_tokens)
freqs = np.array(list(df_counts.values()))
rare_threshold = np.percentile(freqs, 20)
rare_tokens = {t for t, c in df_counts.items() if c <= rare_threshold}
def has_rare_token(q):
return any(t in rare_tokens for t in tokenize(q))
# Synthetic NDCG@10: 1/log2(rank+1) if rank under 10, else 0.
def ndcg_at_10(rank):
return 1.0 / np.log2(rank + 1) if rank <= 10 else 0.0
buckets = {"rare": [], "common": []}
for q, rel in zip(queries, relevant):
d_rank = rank_of_relevant(dense_rank(q_emb[queries.index(q)]), rel)
f_rank = rank_of_relevant(rrf_merge([dense_rank(q_emb[queries.index(q)]),
bm25_rank(q)]), rel)
bucket = "rare" if has_rare_token(q) else "common"
buckets[bucket].append((ndcg_at_10(d_rank), ndcg_at_10(f_rank)))
for name, scores in buckets.items():
if not scores: continue
d_mean = np.mean([s[0] for s in scores])
f_mean = np.mean([s[1] for s in scores])
print(f"{name:>6} (n={len(scores)}): dense={d_mean:.3f} fused={f_mean:.3f} "
f"lift={f_mean - d_mean:+.3f}")
# Expected output (illustrative, depends on which queries hit rare tokens):
# rare (n=3): dense=0.421 fused=0.812 lift=+0.391
# common (n=2): dense=0.683 fused=0.701 lift=+0.018
Lift concentrated on the rare-vocabulary bucket with the common-vocabulary bucket barely moving confirms the pattern in its purest form: the off-the-shelf embedder is fine on web-shaped queries and falling off a cliff on domain-shaped queries. Uniform lift across buckets — both rare and common jumping by similar amounts — indicates undertraining on the domain generally, not just on rare tokens. Both halves of the treatment apply, but the contrastive fine-tune (§4 below) carries more weight than hybrid retrieval.
Test 4 — KL(corpus_vocab ‖ web_vocab) as a monitorable scalar
The three-test diagnostic suffices for initial investigation. Ongoing monitoring wants a single scalar summarizing how far the corpus has drifted from web-text vocabulary. A KL divergence over unigram distributions is the right shape.
import numpy as np
from scipy.special import rel_entr
from collections import Counter
# Simulated "web text" reference: a uniform-ish distribution over common
# English function words and topic words. In production, swap for unigram
# counts from a Common Crawl sample or a saved reference distribution.
web_ref_tokens = (
"the of and to in a is that for on with as by this from at it be or "
"system data model user query result return value function method "
"request response service api version error log time"
).split() * 50
web_counts = Counter(web_ref_tokens)
corpus_counts = Counter(t for toks in doc_tokens for t in toks)
# Union vocabulary; assign smoothed probability to each.
vocab = set(web_counts) | set(corpus_counts)
eps = 1e-6
p_corpus = np.array([corpus_counts.get(t, 0) + eps for t in vocab])
p_web = np.array([web_counts.get(t, 0) + eps for t in vocab])
p_corpus /= p_corpus.sum()
p_web /= p_web.sum()
kl = float(rel_entr(p_corpus, p_web).sum())
print(f"KL(corpus ‖ web) = {kl:.2f}")
# Expected: 2.0 to 4.0 on the fixture — high, because the fixture's
# corpus is entirely API-doc vocabulary against a generic English ref.
Healthy: KL < 1.0 — the corpus is close enough to web text that off-the-shelf embedders should perform near their leaderboard rank. Suspect: KL between 1.0 and 2.5 — domain adaptation will lift the metric measurably. Confirmed: KL > 2.5 — the embedder is operating outside its training distribution and the plateau is structural. Recompute monthly as the corpus grows; a rising trend predicts a falling NDCG before the dashboard catches it.
Worked example end-to-end
A representative trajectory on a domain corpus where this pattern fires cleanly: mixed-alphanumeric share lands in the 8 to 15% range (HTTP status codes, error enums, version strings, function names), dense-only NDCG@10 sits around 0.42 to 0.46, BM25-only around 0.38 to 0.42, RRF-fused around 0.55 to 0.60 — a +12 to +15 point lift from hybrid alone. The rare-vocabulary bucket typically lifts from 0.30 to 0.80 under fusion; the common bucket barely moves from around 0.62 to 0.65. KL(corpus ‖ web) falls between 2.5 and 3.5.
From that baseline, the staged treatment delivers a typical decomposition: hybrid retrieval (§1) +10 to +15 NDCG@10 in the first week with no GPU touched; cached query rewriting (§2) +2 to +4 Recall@100 within the month; continued pretraining on the unlabeled corpus (§3) +3 to +6; contrastive fine-tuning on 10k to 20k labeled (query, doc) pairs with mined hard negatives (§4) +8 to +15; a domain-specialized cross-encoder reranker (§5) +5 to +8 on top. End-to-end the metric routinely moves from the low 0.40s into the mid-to-high 0.70s.
Teams typically jump to fine-tuning; the first two interventions usually close half the gap before any GPU comes online.
Treatment
Five interventions in cost-impact order. Each step assumes the previous one is done.
1. Hybrid search — BM25 + dense, RRF-merged
This was Test 2 of the diagnostic. When it lifts the eval, ship it as the new baseline. 5 to 13 NDCG@10 points free; the recipe is the same rrf_merge snippet above. Production shape: store BM25 and dense indices side by side, query both in parallel, fuse at the application layer. No model surgery required.
def hybrid_retrieve(query, top_k=50):
dense_hits = dense_index.search(embed(query), top_k=top_k)
bm25_hits = bm25_index.search(query, top_k=top_k)
return rrf_merge([
[h.doc_id for h in dense_hits],
[h.doc_id for h in bm25_hits],
])[:top_k]
Two production gotchas. Both branches must use the same chunking, or fusion compares apples to oranges; align chunk boundaries first. The k=60 constant in RRF is the default for a reason — tuning it rarely matters and over-tuning it bakes eval-set bias into production.
2. Query rewriting with cache
A cheap LLM call in front of retrieval expands domain shorthand into longer, more retrievable forms — "ssri side effects" becomes "selective serotonin reuptake inhibitor adverse events drug interactions". Query expansion lifts Recall@100 by 3 to 5 points on most domain corpora. Aggressive caching is mandatory; 60 to 80% of production queries are paraphrases of recurring ones, and the cache hit rate dominates the economics.
# Deterministic temperature is load-bearing: nondeterministic rewrites
# multiply cache keys and collapse the hit rate.
def rewrite(query, domain):
if (hit := cache.get((query, domain))): return hit
rw = cheap_llm.complete(rewrite_prompt(query, domain),
max_tokens=60, temperature=0.0).strip()
cache.set((query, domain), rw, ttl=30 * 86400)
return rw
Run both the original and the rewrite through hybrid_retrieve and fuse the rank lists. The temperature-zero rewrite is the load-bearing detail.
3. Continued pretraining on the corpus
Take a strong open base encoder (bge-base-en-v1.5, around 110M parameters, is a fine starting point), run masked-language-model pretraining on the unlabeled corpus for 1 to 3 epochs, use the result as the new base for downstream fine-tuning. Roughly 3 GPU-days on a single A100, a few hundred dollars in compute, +3 to +6 NDCG@10 over the off-the-shelf model. No labels required — the corpus itself is the training signal, which is why this step is unusually cheap relative to its impact.
The mechanism: continued pretraining teaches the encoder the domain’s vocabulary distribution without yet teaching it the relevance function. Vocabulary is roughly half of the plateau. Domain adaptation at the MLM level is the standard recipe; the BGE-M3 and E5 model cards both document the gain on out-of-distribution corpora.
4. Contrastive fine-tuning on (query, positive, hard-negative) triples
Where most of the remaining lift lives. Take the continued-pretrained base from step 3, fine-tune with contrastive loss (InfoNCE or multiple-negatives-ranking-loss) on roughly 10k labeled (query, positive_doc) pairs with mined hard negatives. +8 to +15 NDCG@10 over the continued-pretrained base. Hard-negative mining is the load-bearing step; without it, the model learns to separate trivially-different docs and the eval barely moves.
# For each (query, positive) pair, retrieve top-k under the current
# encoder, skip the first 3 (likely near-duplicates), take the next N
# as hard negatives. These are the docs the model currently confuses
# for the positive — exactly what the next training round must learn.
def mine_hard_negatives(queries, positives, encoder, k=50, n=4):
triples = []
for q, pos in zip(queries, positives):
cands = [c for c in encoder.search(q, top_k=k) if c.doc_id != pos]
triples += [(q, pos, c.doc_id) for c in cands[3:3 + n]]
return triples
Two traps. False negatives — a “hard negative” that is actually relevant but unlabeled trains the model to push correct retrievals away; skip the top few candidates or run an LLM-judge pass to filter. In-batch negatives alone are not enough on small domains — explicit hard-negative mining contributes most of the lift. Iterate: mine, train, re-mine with the updated encoder, train again. Two or three rounds is typical.
5. Specialized cross-encoder reranker on top
A cross-encoder reranker fine-tuned on the same labeled set closes most of the remaining gap, especially on long-tail queries where the embedding is close-but-not-great. The reranker is structurally a better fit for domain-specific relevance because it attends jointly across query and doc — see reranker on the request path for the production deployment concerns. +5 to +8 NDCG@10 on top of the fine-tuned bi-encoder. The last 20% of the lift, and the most expensive in latency.
What does NOT work — and what most teams try first
- Swapping vendors. Cohere for OpenAI, Voyage for Cohere — all three are trained on web text. Relative ordering shifts; the ceiling barely moves.
- Ensembling two off-the-shelf embedders. Averaging or concatenating vectors from two web-text embedders produces a redundant ensemble — both branches share the same blind spots and the fusion surfaces no new signal. Hybrid with BM25 works because the modality differs; hybrid with another dense model does not.
- Bigger embeddings. Going from 1024-dim to 3072-dim costs more storage and returns essentially nothing on out-of-distribution corpora. The bottleneck is what the model represents, not the dimensionality of the representation.
- A fully custom embedder from scratch. Expensive, brittle, and almost always worse than continued pretraining + contrastive fine-tune of an open base. Worth it only under the constraints in the callout below.
The leaderboard ranks embedders on benchmark distributions. A production corpus is a different distribution. Add BM25 first, rewrite queries second, continue pretraining third, contrastive fine-tune fourth, rerank fifth — in that order. The first two steps cost nothing and deliver half the lift.
This isn’t this pattern when…
| You observe… | This is probably… | Read next |
|---|---|---|
| Hybrid retrieval doesn’t lift the metric and the rare-vocab bucket is empty | The retrieval is fine; generation is the bottleneck | right doc, wrong answer |
| Off-the-shelf embedder is fine on aggregate, eval metric climbs every week, users complain | The eval ranking the embedder is itself stale | eval drift |
| Domain-adapted embedder shipped, threshold-tuned downstream model degraded overnight | Score-distribution shift under the new encoder | threshold by feel |
| Plateau is real, fix-of-choice keeps being “let the LLM rerank everything” | Reaching for the frontier model where a specialist would do | single-LLM overspend |
| Reranker added, latency budget broken | Reranker placement, not embedder quality | reranker on the request path |
The disambiguation rule: embedding plateau is a recall problem (the right doc isn’t in the candidate set); right-doc-wrong-answer is a generation problem (the right doc is there and ignored); eval drift is a measurement problem (the metric is honest but stale). Same surface symptom — “the number won’t climb” — different mechanism underneath.
Numbers that matter
| signal | healthy | suspect | confirmed |
|---|---|---|---|
| mixed-alphanumeric token share | under 2% | 2 to 8% | over 8% |
| hybrid lift over dense-only (NDCG@10) | under 1 point | 3 to 5 points | over 5 points |
| rare-vocab bucket lift under fusion | under 0.05 | 0.10 to 0.25 | over 0.25 |
KL(corpus ‖ web) | under 1.0 | 1.0 to 2.5 | over 2.5 |
| gap from MTEB leader rank to eval rank | within 1 position | 3 to 5 positions | over 5 positions |
Starting thresholds. Each corpus’s “healthy” depends on how literal its query vocabulary is at baseline — free-text customer-support tickets run lower mixed-alphanumeric share than API references, and both can be domain-shifted enough to plateau.
Adjacent patterns
- Eval drift: the eval used to measure the embedder may itself be stale. A domain-adapted retriever that lifts the metric without lifting user signal is the symptom. The two patterns interact: a stale eval can hide a plateau by averaging over old, easy queries.
- Reranker on the request path: the §5 treatment here (cross-encoder rerank) creates the latency problem treated there. The rerank fix should not ship without reading the rerank-deployment playbook first.
- Single-LLM overspend: the wrong-shaped fix for this pattern is adding a frontier-LLM reranker on top of an unimproved embedder and paying per-query latency and dollars for a job a fine-tuned specialist does at 1/100x cost. Treat the embedder, then specialize the reranker.
- Right doc, wrong answer: when the embedder does retrieve the right doc and the LLM still gets it wrong, the problem is downstream. Eliminate this pattern first, then look at the generation layer.
When hybrid retrieval is shipped, query rewriting runs in front, continued pretraining and contrastive fine-tune live on the base, a domain-specialized reranker is in place — and the metric still won’t climb — the pattern is one of the four above, not this one.
