Also known as: instructable reranker, promptable reranker
TL;DR
An instruction-following reranker accepts an explicit instruction or context alongside the (query, document) pair, and reranks accordingly. Lets you inject business rules, user preferences, or domain context per call without retraining.
A standard reranker takes (query, document) and produces a relevance score. An instruction-following reranker takes (query, instruction, document) — the instruction shapes how the reranker interprets relevance.
What lives in the instruction channel
Business rules — “Prefer documents that cite the original signatory.”
User context — “This user is in the medical-records team; weight clinical-grade sources higher.”
Disambiguation hints — “Treat ‘IMO’ as referring to the International Maritime Organization, not the International Mathematical Olympiad.”
Term glossaries — “EBITDA equals earnings before interest, taxes, depreciation, and amortization.”
Recency bias — “The user is debugging current production; prefer documents written in the last six months.”
Why this matters
Without instruction-following, the only way to inject business context into reranking is to pre-process the query (concatenating context) or post-process the results (re-ranking via business rules outside the model). Both are awkward and lose the model’s ability to weigh context against semantic relevance natively.
A reranker trained on instructions can:
Resolve polysemic queries using context cues (the “IMO” example below).
Score documents that provide useful background (not direct answers) appropriately — without dropping them like a strict-relevance reranker would.
Adapt scoring to per-customer rules without per-customer fine-tuning.
The training data shape is (query, instruction, document, score) quadruples — not just (query, document, score). Supervision is sourced through the same pairwise-LLM-judge pipeline used for ordinary rerankers, but the judge is shown the instruction alongside the query when emitting a preference. The model learns to attend to the instruction tokens and condition its relevance scoring on them rather than ignoring them. Without instructions in the training data, fine-tuning a reranker to follow instructions at inference time barely works — instructions act as out-of-distribution noise.
Concrete behavior
zerank-2 is instruction-following. A practical example from its launch eval:
Query: "Candidates with IMO experience"Instruction: "We're looking for engineering talent for a marine logistics company."Document: "Candidate experience: Worked at the International Marine Organization"zerank-2 (no instruction): 0.33zerank-2 (with instruction): 0.64
The same document jumps from “marginally relevant” to “strongly relevant” once the model knows the user’s domain.
Risks
Instruction drift — overusing instructions to compensate for a poor index. Instructions should clarify, not compensate.
Prompt-injection-shaped attacks — if the instruction is shown to the model in the same tokens as the document, a malicious document could include text that overrides the instruction. Production systems segregate the instruction channel.
Instructions become product — once you depend on a particular phrasing, you can’t easily swap reranker models without re-validating instruction behavior.
Go further
How does this differ from query rewriting?
[Query rewriting](/concepts/query-rewriting/) reshapes the query before retrieval — every downstream stage sees the new query. Instructions live in a separate channel only the reranker reads, so they can encode business rules and user context without polluting the first-pass retrieval signal.
Are instructions calibrated the same way scores are?
Sort of — a well-trained instruction-following reranker preserves [calibration](/concepts/score-calibration/) under instruction shifts, so 0.7 still means roughly 'highly relevant' even when the instruction tightens or loosens the relevance criterion. Drift is real though, so re-validate calibration when you change instruction templates.
A/B the same query+candidate set with and without your instruction; measure NDCG@10 against a labeled set that reflects the intent the instruction encodes. If instructions help, the lift should be measurable on instruction-aware queries and roughly neutral on generic ones.