Cross-Lingual Retrieval

Also known as: multilingual retrieval, CLIR, cross-language search

TL;DR

Cross-lingual retrieval is finding documents in one language that answer a query in another. A multilingual embedding or reranker maps text from any language into the same vector space, so a French query can retrieve English documents.

Cross-lingual retrieval (CLIR) is the task of retrieving documents whose language differs from the query’s. The user types in Japanese; the relevant document is in English. A monolingual model fails on this because the surface tokens don’t overlap. A multilingual model trained with cross-lingual alignment maps both languages into a shared embedding space where translation-equivalent text sits close together.

Closing the modality gap is the central challenge in cross-lingual retrieval. Identical content in different languages should land in the same place in vector space — and usually doesn’t.

How it’s trained

Training signals for cross-lingual alignment include:

Cross-lingual training signals

Parallel corpora — sentences with their translations. Push the embeddings of (source, target) pairs together.
Document-aligned multilingual data — Wikipedia articles in multiple languages, news stories with translated counterparts.
LLM-synthesized pairs — modern embedding models often use frontier LLMs to generate cross-lingual training data at scale.
Code-switching examples — queries that mix languages (Spanglish, Hinglish) need explicit examples to handle gracefully.

A model trained on a healthy mix of all four can perform near-monolingually across major languages while staying robust on code-switching.

The “modality gap” problem

Most multilingual embedding models exhibit a strong modality gap: identical content translated across languages doesn’t quite land in the same place in vector space. A query in Spanish ranking English documents tends to score the Spanish documents systematically higher than translation-equivalent English ones — even when the English doc is more relevant.

Mitigations include explicit alignment losses, multilingual contrastive training , and per-language calibration. zembed-1 was trained majority non-English with cross-lingual contrastive objectives specifically to flatten this gap.

When it matters

CLIR is critical for:

Enterprise teams whose corpus has mixed-language documents.
Knowledge bases that exist primarily in English but are queried by global users.
Legal/medical/scientific search where authoritative documents may be in another language.
Any vertical where translation memory and source documents live together.

Performance on CLIR is usually a separate column on retrieval benchmarks — don’t assume strong English benchmarks predict strong cross-lingual behavior.

BM25 scores documents by token overlap with the query, weighted by inverse document frequency. The mechanism is purely lexical: a query token has to literally appear in a document for that document to score nonzero on the term.

Across languages, the surface tokens never overlap (modulo the rare loanword or named entity that survives transliteration). A French query for “voiture électrique” sees zero BM25 score against any English document, regardless of how perfectly it matches semantically. Hybrid search architectures that combine BM25 with dense retrieval lose the BM25 leg entirely in CLIR — it contributes no signal.

The exception is named-entity-heavy queries. Proper nouns (“BMW”, “Tokyo”, “GPT-4”) often survive across languages, and BM25 can recover them; for entity-anchored CLIR a hybrid still helps. For free-text semantic queries, dense retrieval carries 100% of the weight.

The wrong way is to take an English benchmark like BEIR, translate the queries, and report numbers. That measures translation quality plus retrieval quality without separating them — and the resulting score doesn’t correspond to any real workload.

The right way is to construct a small held-out set with native-language queries against your own multilingual corpus. Aim for 50-200 queries per language pair you care about, with relevance judgments at the doc level. Pair each query with both same-language relevant docs (the monolingual baseline) and cross-language relevant docs (the CLIR test).

Report NDCG@10 and recall@100 separately for each language pair. The asymmetric pairs (French query → English doc, vs English query → French doc) often show different numbers; both directions matter. The numbers will be noisy at small N, so bootstrap a confidence interval — improvements smaller than the CI width aren’t real.

Go further

How do you evaluate a cross-lingual reranker on your own corpus?

You can't trust English benchmarks here — build a small held-out set with queries in each language paired against your own documents, then measure NDCG@10 and recall per language pair. The asymmetric pairs (French query vs English doc) are where the modality gap shows up.

Evaluating a reranker on your own data (playbook) NDCG@k MTEB

Is hybrid search (BM25 + dense) still useful across languages?

BM25 falls apart cross-lingually because it depends on token overlap, and the tokens don't overlap across languages. The dense leg of hybrid search carries almost the entire weight in true CLIR; BM25 only helps when query and corpus share a language.

Hybrid search end-to-end (playbook) BM25 First-pass retrieval

Can a reranker fix a weak first-pass multilingual embedder?

Partially — a strong multilingual cross-encoder can rescue many cases where the first-pass embedder ranked the right doc in the top 100 but not the top 10. It can't help if the relevant doc never made it into the candidate set at all.

Reranker Cross-encoder First-pass retrieval

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs