Hard-Negative Mining

Also known as: hard negatives, negative mining, adversarial negatives

TL;DR

The training-data trick that makes embedders actually competitive: source negatives that look similar to the positive but aren't actually relevant.

Hard-negative mining is the practice of deliberately sourcing training negatives that resemble the positive — same domain, same vocabulary, similar surface form, but actually irrelevant. It’s the single most important data lever in training a competitive embedder , and the difference between a model that scores in the middle of the MTEB leaderboard and one that lands at the top.

The motivation: contrastive learning gradients are largest when the model is almost wrong. If the negative is obviously different from the positive, the loss is near zero and nothing is learned. If the negative is genuinely confusable, gradient stays alive and the encoder is forced to learn fine-grained distinctions.

Why random negatives fail

A random document drawn from a large corpus is almost orthogonal to the query in embedding space — they share no topic, no vocabulary, no structure. The model nails them in the first epoch and the loss flatlines. In-batch negatives suffer the same problem: every other document in the batch is also random.

This is the saturation regime. You can keep training, the loss stays low, and the model stops getting better. Hard negatives push you out of saturation by injecting examples the model still gets wrong.

How to mine

There are three standard sources of hard negatives, used in combination:

Mining strategies

BM25 top-K, minus the positive. Cheap. Catches lexical confusables — documents with the right keywords but the wrong meaning. The traditional baseline since DPR (Karpukhin et al., 2020).
Bi-encoder top-K, minus the positive. Run an existing (or earlier-checkpoint) embedder over the corpus, take its top hits, exclude known positives. Catches semantic confusables. Usually iterated as the embedder improves.
Cross-encoder rescoring. Take BM25 or bi-encoder candidates, re-rank with a cross-encoder, keep middling-scored ones. The high-scored ones are likely true positives; the low-scored ones are easy negatives. The middle is the sweet spot.

DPR (Karpukhin et al., 2020) popularized BM25 mining; since then almost every embedder paper has used some combination. Modern recipes (zembed-1, E5-Mistral, BGE) layer cross-encoder filtering on top to handle the false-negative problem.

Real corpora have multiple relevant documents per query, but only one (or a few) are labeled positive. If you mine hard negatives by taking the top-K most-similar-but-not-labeled documents, you’ll inevitably include actual relevant documents that weren’t labeled. The loss then pushes the encoder away from those — undoing the very behavior you want.

The standard fixes: (a) score mined candidates with a teacher cross-encoder and drop anything above a threshold (treats high-similarity-from-teacher as “probably a true positive”), (b) skip the top-N mined results entirely and start at rank N+1, (c) cluster the corpus and only mine within other clusters. (a) is most common in production.

How many hard negatives, and how often

Most pipelines pair each positive with 5-15 mined hard negatives, plus the in-batch negatives that come for free. The hard negatives are usually re-mined every few epochs as the model improves — yesterday’s hard negatives become today’s easy ones, so you re-rank against the latest checkpoint to keep the difficulty curve alive.

This is sometimes called “iterative” or “online” hard-negative mining. The cost: each mining round runs the corpus through an embedder, then re-scores top-Ks, then writes new training shards. For a 100M-document corpus that’s a few hours of GPU time per round. Most teams budget for 2-5 rounds across full training.

Where this connects to distillation

Hard-negative mining and knowledge distillation intersect: the cross-encoder you use to filter false negatives can also be the teacher whose graded scores supervise the embedder. Modern distilled embedders (zembed-1, BGE, GTE) use the same cross-encoder for both jobs — filter mined negatives, then provide continuous supervision on the surviving pairs. This is why distilled embedders dominate graded benchmarks.

Go further

Why aren't in-batch negatives enough?

Random batch members are usually trivially distinguishable from the positive. After the first few thousand steps, the model gets them right and the gradient drops to near zero. To keep training informative you need negatives that are genuinely confusable — ones that share vocabulary, domain, or surface structure with the positive but mean something different.

In-batch negatives Contrastive learning

How do you find hard negatives at scale?

Three common methods: BM25 retrieval over the corpus (cheap, surface-level confusables), an existing embedder's top-K minus the positive (semantic confusables), or a cross-encoder rescoring step that filters candidates near the positive's score. Production pipelines usually combine all three to cover lexical and semantic confusability.

BM25 Bi-encoder Cross-encoder

What's the false-negative problem?

When you mine the top-K most similar documents as negatives, some of them are actually relevant but unlabeled. Training treats them as negatives anyway, teaching the model the wrong lesson. The standard fix is to filter mined candidates with a teacher cross-encoder and discard any whose score is too high — they're either positives or near-positives, not negatives.

Cross-encoder Knowledge distillation

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs