Approximate Nearest Neighbor (ANN)

Also known as: ANN, approximate nearest neighbor search, vector index, HNSW, IVF

TL;DR

ANN algorithms — HNSW, IVF, ScaNN — find the closest vectors to a query without scanning all of them. They give up a small slice of recall in exchange for orders-of-magnitude speedup.

Exact nearest-neighbor search over a vector corpus requires comparing the query to every document — work per query, billions of dot products at scale. Approximate nearest neighbor (ANN) algorithms trade a small amount of recall for orders-of-magnitude speedup by indexing the vectors into structures that prune the search space.

The three families that dominate production:

HNSW (Hierarchical Navigable Small World)

A multi-layer graph where each node is a vector and edges connect similar vectors. The top layer is sparse (long-range hops); each lower layer is denser. A query enters at the top, greedily walks toward the nearest neighbor, drops a layer, walks again, and so on. The result is a path of hops to land in the right neighborhood, then a local search of a few hundred candidates.

Key knobs:

M — neighbors per node (graph degree). More = better recall, more memory.
efConstruction — build-time search width. More = better graph quality.
efSearch — query-time search width. The recall/latency dial.

HNSW is the default for hnswlib, FAISS-HNSW, pgvector, Qdrant, Weaviate, and most vector DBs.

IVF (Inverted File)

K-means cluster the vectors into a few thousand cells (centroids). At query time, compare the query to each centroid, pick the top few cells, and exhaustively scan only those. Recall depends on nprobe — how many cells you visit.

IVF is much more memory-efficient than HNSW (no graph) and pairs naturally with product quantization (IVF-PQ). At billion-scale where RAM is the binding constraint, IVF-PQ is still the workhorse.

ScaNN

Google’s library, closer to IVF + asymmetric quantization with anisotropic loss tuning. Strong on TPU/GPU; less common in self-hosted stacks.

What “approximate” actually costs you

The curve is sharp: a small bump in efSearch (HNSW) or nprobe (IVF) recovers most of the lost recall at modest latency cost. Recall floors below 0.95 usually mean the index is misconfigured rather than fundamentally inadequate.

Where ANN gets harder

High dimensionality — see curse of dimensionality . Beyond ~2000 dims, distance concentration starts hurting graph navigation. MRL truncation helps.
Skewed corpora — if one cluster contains 80% of vectors, IVF degenerates. K-means can be replaced with hierarchical or balanced clustering.
Frequent updates — HNSW handles inserts well but deletes leave tombstones; periodic rebuilds matter at scale.
Filtered search — combining “near this query” with “where tenant = X” is non-trivial for graph indexes; pre-filtering, post-filtering, and partition strategies all have failure modes.

HNSW’s recall-vs-latency curve is dramatically better at moderate scale. For a corpus of 10M-100M vectors, HNSW with default parameters gives 0.99+ Recall@10 at sub-millisecond latency without tuning ritual. IVF requires picking centroids, picking nprobe, and re-tuning when the corpus shifts — and even tuned, its recall plateau is lower at the same latency budget.

Where the curves cross is memory. HNSW stores 32-64 neighbors per vector at the bottom layer. At a billion vectors with 768-dim float32 embeddings, the graph is several hundred GB and exceeds single-machine RAM. IVF’s only per-vector overhead is the cell assignment, so IVF-PQ scales to billions on commodity boxes where HNSW would need a sharded cluster.

The third option — flat search — is underrated under a few million vectors. Modern CPUs scan millions of 768-dim vectors per second; the engineering simplicity is worth a lot. The “you must use ANN” reflex is wrong below ~5M vectors for most workloads.

“Find the nearest 10 vectors where tenant_id = ‘acme’” is the bread-and-butter production query, and where naive ANN falls over. The graph or cluster structure was built without knowledge of the filter, so two failure modes appear.

Pre-filtering — apply the filter first, then search the subset — works only if the filter cuts the corpus to a size where exact search is fast, or if you maintain one HNSW per filter value. The latter explodes index count; the former silently degrades to brute force.

Post-filtering — search ANN, drop results that fail the filter — works only if the filter is satisfied by most vectors near the query. When the filter is rare, the ANN may scan thousands of neighbors before finding ten that match, blowing the latency budget.

The production answer is labeled HNSW (or filterable IVF) — bake filter awareness into the index structure. Vespa, Qdrant, Weaviate, and pgvector all implement variants. The catch: build-time filter selection has to align with query-time filter shape; index for tenant_id and query by user_id, you’re back to brute force.

ANN is what makes dense retrieval over a real corpus tractable. Without it you’re stuck at O(N) cosine and the only viable retrieval is BM25.

Go further

How much recall am I really giving up?

On a well-tuned HNSW index, Recall@10 vs exact search is typically 0.98-0.995. Below 0.95 something is misconfigured (M too low, efSearch too low, or the corpus is genuinely high-dimensional and concentrated). The recall/latency curve is sharp — you usually get the last few points of recall almost for free.

Recall@K Dense retrieval First-pass retrieval

HNSW vs IVF vs ScaNN — which one?

HNSW is the modern default: best recall/latency curve, easy to tune, the basis of FAISS-HNSW, hnswlib, pgvector. IVF (inverted file) wins on memory because it doesn't store a graph — useful at billion-scale where RAM is the constraint. ScaNN (Google) is closer to IVF + asymmetric quantization, with strong throughput on TPU/GPU.

Embedding quantization Curse of dimensionality

Does ANN compose with embedding quantization and Matryoshka?

Yes — and you should. Quantize the vectors to int8 or 1-bit to shrink the index 4-32×, optionally truncate to a lower MRL dimension to shrink it further, then build the ANN graph over the compact representation. A second-stage rescore with full-precision vectors recovers most of the lost recall.

Embedding quantization MRL / Matryoshka Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs