A cross-encoder takes a (query, document) pair as a single joint input and produces one relevance score. It captures token-level interactions between query and document — much more accurate than embedding them separately, at higher cost per pair.
A cross-encoder is a transformer that consumes the query and the document together — usually concatenated as [CLS] query [SEP] document [SEP] — and outputs a single relevance score. The key word is jointly: every layer of the model can attend across query tokens and document tokens, learning which words in the query line up with which words in the document.
The bi-encoder commits to a fixed embedding for each text in isolation; the cross-encoder reads them together. Joint attention is the whole edge.
A bi-encoder encodes query and document separately and only compares them at the final dot-product step — it never gets to “see” the two together inside the model. That asymmetry is where the cross-encoder’s accuracy edge comes from.
The cost picture
Cross-encoders are slower per pair because they require a full forward pass per (query, document) — you can’t precompute and cache like you can with bi-encoder embeddings. That’s why cross-encoders are used as rerankers over a small candidate set (~100 docs per query), not for first-pass retrieval over millions of documents.
The latency math: a typical cross-encoder runs ~100 pairs per query in ~150ms total on GPU. That’s plenty fast for production RAG, but it would be economic suicide to run it across an entire million-document corpus per query.
When cross-encoders win
The accuracy gap over bi-encoders is largest on:
Where the joint-attention edge shows up most
Hard negatives — documents that look superficially similar to the query but are actually irrelevant. Bi-encoders embed both into the same dense neighborhood; cross-encoders can detect the semantic mismatch.
Long documents — bi-encoders compress everything into one fixed-size vector; cross-encoders can attend selectively to the relevant span.
Polysemic queries — words with multiple meanings. The cross-encoder uses the query as context to disambiguate the document.
Negation and quantifier queries — “documents about X but not Y” requires reading the candidate against the query at the token level.
Numeric or entity matching — exact identifiers that bi-encoders blur into a generic region of the embedding space.
Production implementations
zerank-2 is a cross-encoder. So is Cohere’s rerank-3.5, Voyage rerank-2.5, and Jina’s reranker m0. The architectural class is converged; the differentiation is in training data, calibration, and per-pair latency.
LLM quantization works because a generative decoder is robust to small per-token output perturbations — a few logit-rank flips don’t visibly change the response. A cross-encoder produces a single scalar relevance score per pair, and the ordering of those scores is the entire output. Small numerical perturbations propagate directly into rank flips at the top of the candidate list, where ranking sensitivity is highest.
Empirically, a cross-encoder quantized to int8 loses 1-2 NDCG@10 points on hard benchmarks; the same model at fp16 holds full accuracy. The cost picture supports it — a 4B cross-encoder running 100 pairs in 150ms is a small slice of the total RAG latency budget, so the throughput win from quantization isn’t load-bearing.
For very latency-sensitive deployments, the better lever is cascade reranking (cheap small cross-encoder for top-1000, expensive larger one for top-100) rather than aggressive quantization of a single stage.
A pointwise cross-encoder scores each (query, doc) pair independently and orders by score. A listwise LLM reranker takes the entire candidate list at once and emits an ordering directly — typically by prompting a frontier LLM with the query plus all candidates and asking for a re-ordered list.
The listwise model has access to comparative information the pointwise model doesn’t: knowing what other candidates are in the list lets it apply tie-breaking and diversity reasoning. On benchmarks where ordering quality matters more than raw relevance, listwise LLM rerankers can beat the best cross-encoders by a small margin.
The cost picture is brutal, though. A listwise rerank over 100 candidates is a single LLM call with all 100 documents in the prompt — easily 30K-50K input tokens, sometimes more, plus the output generation. At frontier-LLM rates that’s ~0.001 for a cross-encoder. For most production workloads, the cross-encoder is the right operating point; the listwise stage shows up only at the very top of the latency-quality Pareto frontier.
Go further
Why can't I just use a cross-encoder for first-pass retrieval?
Cost. A cross-encoder is one full forward pass per (query, document) pair — at a million-document corpus that's a million forwards per query. Bi-encoders precompute document embeddings once and reduce query-time cost to a dot product. Cross-encoders are economical only over the small candidate set a bi-encoder feeds them.
Can a cross-encoder do listwise reasoning over the candidate set?
Standard cross-encoders are pointwise — they see one (query, doc) pair at a time. To reason across candidates you need a [listwise reranker](/concepts/listwise-reranking/) (typically LLM-based), or a downstream MMR/diversity step composed on top of pointwise scores.
The state of the art uses graded relevance scores rather than binary labels. [zELO](/concepts/zelo/) generates those scores via pairwise LLM votes plus a Thurstone fit; the cross-encoder then regresses on them. The result is a calibrated pointwise model trained without human annotation.