Instruction-Following Reranker

Q: How does this differ from query rewriting?

Query rewriting reshapes the query before retrieval — every downstream stage sees the new query. Instructions live in a separate channel only the reranker reads, so they can encode business rules and user context without polluting the first-pass retrieval signal.

Also known as: instructable reranker, promptable reranker

TL;DR

An instruction-following reranker accepts an explicit instruction or context alongside the (query, document) pair, and reranks accordingly. Lets you inject business rules, user preferences, or domain context per call without retraining.

A standard reranker takes (query, document) and produces a relevance score. An instruction-following reranker takes (query, instruction, document) — the instruction shapes how the reranker interprets relevance.

What lives in the instruction channel

Business rules — “Prefer documents that cite the original signatory.”
User context — “This user is in the medical-records team; weight clinical-grade sources higher.”
Disambiguation hints — “Treat ‘IMO’ as referring to the International Maritime Organization, not the International Mathematical Olympiad.”
Term glossaries — “EBITDA equals earnings before interest, taxes, depreciation, and amortization.”
Recency bias — “The user is debugging current production; prefer documents written in the last six months.”

Why this matters

Without instruction-following, the only way to inject business context into reranking is to pre-process the query (concatenating context) or post-process the results (re-ranking via business rules outside the model). Both are awkward and lose the model’s ability to weigh context against semantic relevance natively.

A reranker trained on instructions can:

Resolve polysemic queries using context cues (the “IMO” example below).
Score documents that provide useful background (not direct answers) appropriately — without dropping them like a strict-relevance reranker would.
Adapt scoring to per-customer rules without per-customer fine-tuning.

The training data shape is (query, instruction, document, score) quadruples — not just (query, document, score). Supervision is sourced through the same pairwise-LLM-judge pipeline used for ordinary rerankers, but the judge is shown the instruction alongside the query when emitting a preference. The model learns to attend to the instruction tokens and condition its relevance scoring on them rather than ignoring them. Without instructions in the training data, fine-tuning a reranker to follow instructions at inference time barely works — instructions act as out-of-distribution noise.

Concrete behavior

zerank-2 is instruction-following. A practical example from its launch eval:

Query: "Candidates with IMO experience"
Instruction: "We're looking for engineering talent for a marine logistics company."

Document: "Candidate experience: Worked at the International Marine Organization"

zerank-2 (no instruction): 0.33
zerank-2 (with instruction): 0.64

The same document jumps from “marginally relevant” to “strongly relevant” once the model knows the user’s domain.

Risks

Instruction drift — overusing instructions to compensate for a poor index. Instructions should clarify, not compensate.
Prompt-injection-shaped attacks — if the instruction is shown to the model in the same tokens as the document, a malicious document could include text that overrides the instruction. Production systems segregate the instruction channel.
Instructions become product — once you depend on a particular phrasing, you can’t easily swap reranker models without re-validating instruction behavior.

Go further

How does this differ from query rewriting?

[Query rewriting](/concepts/query-rewriting/) reshapes the query before retrieval — every downstream stage sees the new query. Instructions live in a separate channel only the reranker reads, so they can encode business rules and user context without polluting the first-pass retrieval signal.

Query rewriting Reranker First-pass retrieval

Are instructions calibrated the same way scores are?

Sort of — a well-trained instruction-following reranker preserves [calibration](/concepts/score-calibration/) under instruction shifts, so 0.7 still means roughly 'highly relevant' even when the instruction tightens or loosens the relevance criterion. Drift is real though, so re-validate calibration when you change instruction templates.

Score calibration Cross-encoder

How do I evaluate instruction quality?

A/B the same query+candidate set with and without your instruction; measure NDCG@10 against a labeled set that reflects the intent the instruction encodes. If instructions help, the lift should be measurable on instruction-aware queries and roughly neutral on generic ones.

Evaluating a reranker on your own data (playbook) NDCG@k zELO

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs