- zembed-1 retains the #1 overall embedding model position, outperforming harrier-27b on average NDCG@10 (0.701 vs 0.699) and Recall@100 (0.750 vs 0.728)
- On a per-dataset basis, zembed-1 wins 14 out of 24 datasets against harrier-27b on NDCG@10
- voyage-4 and harrier-27b are neck-and-neck for the #2 spot — voyage-4 edges it out 12–11 on dataset wins
- The Harrier family scales well internally (270M → 0.6B → 27B), but even the largest variant doesn’t close the gap to zembed-1
- Explore the full interactive dashboard →
zembed-1 vs Harrier
A New Challenger, Evaluated Properly
Harrier is a recently released family of open-weight embedding models from Microsoft (finetuned Gemma and Qwen models), spanning three sizes: 270M, 0.6B, and 27B parameters. The largest variant — harrier-27b — has generated well-deserved attention. On binary MTEB, it ranked first among embedding models at time of release.
But as we explored in Beyond Binary, MTEB has a discrimination problem: given its (overwhelmingly) binary annotations, it can’t tell the difference between a document which perfectly answers a query and one which may only tangentially address it. So we ran all three Harrier models through the same graded evaluation pipeline we use for our evals dashboard — 24 diverse datasets, three independent LLM judges, continuous relevance scores from 0 to 10.
It is not a question of harrier-27b is a good model. As a matter of fact, it is. (And at 27 billion parameters and a whopping 5,376 output vector dimensionality, we would certainly hope it would be). But is it the best?
The Three Model Problem: MTEB Evals
On the global average across all 24 evaluation datasets, there are three embedding models which markedly outperform the rest:
| Model | NDCG@10 | Recall@10 | Recall@100 |
|---|---|---|---|
| zembed-1 | 0.701 | 0.454 | 0.750 |
| voyage-4 | 0.699 | 0.457 | 0.731 |
| harrier-27b | 0.699 | 0.456 | 0.728 |
Below those, qwen3-4b, cohere-embed-v4, jina-v5-text-small, and openai-v3-large (in that order) form a cluster of second tier performance. But if you need top-tier accuracy, your choice lies in that trio.
So what separates them? On NDCG@10, very little — less than 0.25% across the trio (though it’s worth noting that harrier still comes out worst). But NDCG@10 is not the whole story.
On Recall@100 — the metric that determines whether a relevant document even makes it to your reranker — zembed-1 leads by +1.9 points over voyage-4, and +2.2 over harrier-27b.
That is where the separation becomes real. A reranker or other downstream system can reorder or rework whatever the embedding model surfaces, but it cannot conjure up a document the embedder failed to retrieve. zembed-1’s recall advantage compounds downstream: fewer relevant documents lost at the first stage means a strictly better candidate set for everything that follows.
Head-to-Head: zembed-1 vs harrier-27b
Averages, of course, can obscure as much as they reveal. So let us go dataset by dataset. Our evals dashboard covers 24 datasets drawn from three MTEB task categories — retrieval, reranking, and instruction retrieval — spanning legal (AILAStatutes, LegalBench), medical (CovidRetrieval, TRECCOVID), multilingual (MIRACL, MLQA, Belebele, WikipediaRetrieval), and technical domains (StackOverflowQA, SCIDOCS), among others.
Across 24 evaluation datasets, zembed-1 outperforms harrier-27b on NDCG@10 on 14 of them.
The pattern of where each model wins is telling. zembed-1 dominates on instruction retrieval (Core17, News21, Robust04 — tasks which require parsing nuanced query intent, not merely matching keywords), medical and legal domains (CovidRetrieval, LegalBench, TRECCOVID), and technology (StackOverflowQA). harrier-27b, for its part, shows strength on multilingual reranking and a handful of niche benchmarks (RuBQReranking, Russian paragraph reranking; and TwitterHjerne, Danish Twitter retrieval).
| Dataset | zembed-1 | harrier-27b | Delta |
|---|---|---|---|
| Core17InstructionRetrieval | 0.899 | 0.837 | +6.2 |
| Robust04InstructionRetrieval | 0.857 | 0.788 | +6.9 |
| TRECCOVID | 0.922 | 0.871 | +5.1 |
| News21InstructionRetrieval | 0.919 | 0.910 | +0.8 |
| LEMBPasskeyRetrieval | 0.891 | 0.825 | +6.6 |
| CovidRetrieval | 0.820 | 0.796 | +2.3 |
| AlloprofReranking | 0.851 | 0.832 | +1.9 |
| LegalBenchCorporateLobbying | 0.875 | 0.860 | +1.5 |
| StackOverflowQA | 0.695 | 0.651 | +4.4 |
| T2Reranking | 0.804 | 0.794 | +1.0 |
| MIRACLRetrievalHardNegatives | 0.531 | 0.526 | +0.5 |
| MLQARetrieval | 0.034 | 0.029 | +0.5 |
| WikipediaRetrievalMultilingual | 0.778 | 0.774 | +0.5 |
| VoyageMMarcoReranking | 0.732 | 0.739 | -0.7 |
| StatcanDialogueDatasetRetrieval | 0.723 | 0.742 | -1.9 |
| TwitterHjerneRetrieval | 0.694 | 0.775 | -8.1 |
| SCIDOCS | 0.540 | 0.623 | -8.3 |
| RuBQReranking | 0.736 | 0.801 | -6.5 |
| AILAStatutes | 0.700 | 0.740 | -4.0 |
| WikipediaRerankingMultilingual | 0.596 | 0.626 | -3.0 |
| ArguAna | 0.564 | 0.566 | -0.3 |
| BelebeleRetrieval | 0.073 | 0.073 | -0.0 |
| HagridRetrieval | 0.897 | 0.899 | -0.2 |
The Race for Second Place
As we established in our previous head-to-head, voyage-4 has been the reigning #2 embedding model. With harrier-27b now in the picture, that position is genuinely contested:
| voyage-4 | harrier-27b | |
|---|---|---|
| Average NDCG@10 | 0.699 | 0.699 |
| Average Recall@100 | 0.731 | 0.728 |
| Dataset wins (head-to-head) | 12 | 11 |
It is… remarkably close. voyage-4 holds its edge by a single dataset win and a slight recall advantage. The two models trade blows across verticals, and depending on your vertical, either could be the better runner-up. (Neither, however, threatens first place.)
Harrier’s Scaling Story
One genuinely interesting aspect of the Harrier family is its range of sizes. The scaling is clean — and instructive:
| Model | Params | Avg NDCG@10 | Avg Recall@100 |
|---|---|---|---|
| harrier-270m | 270M | 0.619 | 0.658 |
| harrier-0.6b | 600M | 0.650 | 0.691 |
| harrier-27b | 27B | 0.699 | 0.728 |
A +3.1 point NDCG jump from 270M to 0.6B, then +4.9 points from 0.6B to 27B. Returns to scale are not completely diminishing — the largest absolute improvement comes from the largest model. Credit where credit is due: this is a decently-executed scaling curve, particularly for the 0.6b model which demonstrates equal or better performance to Cohere’s flagship embed-v4.
- harrier-270m (0.619) outperforms bge-m3 (0.580) and openai-v3-small (0.588) — entirely respectable for a 270M-parameter model
- harrier-0.6b (0.650) is competitive with cohere-embed-v4 (0.652)
- harrier-27b (0.699) enters the top three — but requires 27 billion parameters and 5,376-dimensional output vectors to get there, compared to zembed-1’s 4 billion parameters and 2,560 dimensions
The contrast in size between harrier-27b and all other models bears emphasis: 27 billion parameters is absolutely massive for an embedding model, and that’s not a compliment.
zembed-1 achieves its #1 ranking with 4 billion parameters and a 2,560-dimensional output. harrier-27b needs nearly 7x the parameter count and 2x the vector dimensionality to land 0.2% behind on NDCG@10. In a production setting — where embedding compute, storage costs, and index size are real constraints — the efficiency gap is hardly academic. Would you pay for a model with 7x higher inference cost, probably 7x as much latency, which outputs an embedding that’s twice as costly to store, just to get worse results?
We wouldn’t.
What This Means
harrier-27b is a legitimate top-three embedding model — quite possibly the strongest new entrant we have seen since voyage-4. It is genuinely competitive, especially on multilingual reranking tasks, and we expect Microsoft will continue to iterate on the family.
But the leaderboard has not changed:
zembed-1 leads on average NDCG@10, wins 14 of 24 datasets head-to-head against harrier-27b, and holds the highest Recall@100 of any embedding model — at 1/7th the parameter count and half the vector dimensionality.
For the full interactive breakdown across all models, datasets, metrics, and reranker combinations, explore the evaluation dashboard.
Get Started
zembed-1 is available today through multiple deployment options:
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.models.embed(
model="zembed-1",
input_type="query", # "query" or "document"
input="What is retrieval augmented generation?", # string or list[str]
dimensions=2560, # optional: must be one of [2560, 1280, 640, 320, 160, 80, 40]
encoding_format="float", # "float" or "base64"
latency="fast", # "fast" or "slow"; omit for auto
)Documentation: docs.zeroentropy.dev
HuggingFace: huggingface.co/zeroentropy
Get in touch: Discord community or contact@zeroentropy.dev
Talk to us if you need a custom deployment, volume pricing, or want to see how zembed-1 + zerank-2 performs on your data.
