Vector Databases & ANN Indexes
How HNSW, IVF, and ScaNN trade recall for speed, why exact KNN doesn't scale, and how to choose between pgvector, Qdrant, and Pinecone for production use.
A flat array of embeddings scales to about 50k–100k vectors before linear-scan latency becomes a real problem. Above that, you need an index structure that trades a bounded amount of recall for O(log n) or better query time. That is what a vector database is: an ANN index wrapped in a storage and serving layer.
Opening bridge
In the previous article we established how embedding models map text to high-dimensional vectors and why cosine similarity measures semantic proximity. But knowing how to produce a query embedding is half the problem. Efficiently finding the 5 closest vectors in a corpus of 10 million is the other half. This article covers the index structures that make that search practical — HNSW, IVF, and ScaNN — and the systems trade-offs of the major databases that host them.
Why exact KNN doesn’t scale
Brute-force k-nearest-neighbor — compute cosine similarity between the query vector and every stored vector, sort, return top-k — is exact and requires no preprocessing. It is also O(n·d): for 10M vectors at 1536 dimensions, that is 15 billion multiply-accumulate operations per query. On a modern CPU this takes hundreds of milliseconds; on a GPU, 10–50ms after batching overhead. For interactive search that is too slow, and for batch pipelines it burns compute budget fast.
The trade-off ANN algorithms accept is “approximate” — they do not guarantee finding the exact nearest neighbor on every query. In practice, a well-tuned index returns 95–99% of the true top-k neighbors at 1/100th the compute cost. For retrieval in AI systems, that is almost always the right trade.
The distributed systems parallel
HNSW is a hierarchical skip list for vector space. A skip list maintains multiple levels of linked lists with exponentially decreasing density, enabling O(log n) traversal by jumping over large portions of the structure at higher levels. HNSW does the same in high-dimensional space: the top layers are a coarse graph connecting far-apart landmark nodes; the bottom layer is a dense proximity graph over the full corpus. A query enters at the top layer, greedily descends toward closer nodes at each level, and emerges at the bottom layer already near the target region.
IVF is a partitioned index — the k-means analogue of a hash-partitioned database table. Pre-cluster the corpus into nlist Voronoi cells. At query time, find the nearest centroids, then scan only those cells’ inverted lists. It is fast to build but recall depends entirely on how many cells you probe.
The filtered ANN problem is the composite-index problem from relational databases. If you need the 5 nearest embeddings for a given user, you cannot trivially layer a WHERE user_id = 42 filter on top of a graph-traversal index. Pre-filtering (restrict to matching rows, then do ANN) destroys graph connectivity when the filtered set is small. Post-filtering (run ANN for top-1000, then filter down) wastes work when most results fail the predicate. Databases with native filtered ANN thread the filter into the graph traversal itself — which ones do this well is one of the main differentiators below.
| Systems concept | Vector analogue |
|---|---|
| Skip list | HNSW layered graph — O(log n) traversal |
| Hash partition | IVF coarse quantizer — cluster, then scan |
| B-tree composite index | Filtered ANN — attribute + vector, jointly optimized |
| Full table scan | Flat/exact KNN — correct but O(n·d) per query |
HNSW
Hierarchical Navigable Small World (original paper) is the dominant algorithm in production vector databases. Three parameters govern its behavior:
m— bidirectional edges per node in the proximity graph. Higherm→ better recall, more memory, slower inserts. Typical range: 16–64.ef_construction— size of the dynamic candidate list during index build. Higher → better graph quality, slower build. Typical: 64–200.ef_search— candidate list size during query traversal. Higher → better recall, slower queries. Crucially, you tune this at query time without rebuilding the index, giving you a runtime recall/latency knob.
HNSW inserts are O(log n) and updates are supported (though expensive). The index is memory-resident by default: 10M vectors at float32, 1536 dimensions, with m=16 requires roughly 60 GB for raw vectors plus ~20% overhead for graph edges. Most databases now offer on-disk variants (pgvector’s diskann extension, Qdrant’s on-disk segments) for corpora that do not fit in RAM.
IVF and IVF-PQ
Inverted File Index trains a k-means quantizer to partition the space into nlist Voronoi cells, then stores an inverted list of which vectors belong to each cell. At query time, find the nearest nprobe centroids and scan only those lists.
nlist: a common starting heuristic issqrt(n)for a corpus ofnvectors.nprobe:nprobe=1gives maximum speed but poor recall;nprobe=nlistdegrades to a full scan. Typical production values: 10–50.
IVF is often combined with Product Quantization (IVF-PQ), which compresses each vector into a compact code by independently quantizing sub-vectors. A 1536-dim float32 vector (6 KB) compresses to 64–128 bytes — a 50–100× reduction — at a modest recall cost. IVF-PQ is the workhorse algorithm when the index must fit in a fraction of the raw vector memory.
The build process requires all vectors upfront for the k-means clustering step. IVF does not support incremental inserts well; HNSW does.
ScaNN
Google’s ScaNN (Scalable Nearest Neighbors) introduces anisotropic vector quantization: rather than minimizing uniform reconstruction error, it optimizes specifically for maximizing inner product between query and document vectors. For asymmetric retrieval — where query and document embeddings come from different distributions — this produces better recall per compression ratio than IVF-PQ. ScaNN is available as a Python library and powers several Google production search systems.
Code examples
pgvector — Postgres-native ANN
pgvector adds a vector type and HNSW/IVF indexes to Postgres. If you already run Postgres, this is the zero-new-infrastructure path.
| |
| |
Qdrant — native filtered ANN
Qdrant integrates payload filters directly into HNSW graph traversal, avoiding the post-filter recall degradation that plagues bolt-on approaches.
| |
| |
Choosing a vector database
| Database | Index | Hosting | Best for |
|---|---|---|---|
| pgvector | HNSW, IVF | Self-host / RDS / Supabase | Already on Postgres; corpora < 5M vectors |
| Qdrant | HNSW + native filters | Self-host / Qdrant Cloud | Filtered ANN, multi-tenant workloads, large collections |
| Pinecone | Proprietary ANN | Fully managed | Serverless, zero-ops, latency-sensitive at scale |
| Weaviate | HNSW | Self-host / Cloud | GraphQL interface, hybrid BM25+vector search built-in |
| Chroma | HNSW (hnswlib) | Self-host | Local dev, prototyping, < 1M vectors |
The decision is mostly an operational one. If your team already operates Postgres at the scale of your corpus, pgvector keeps the operational surface minimal and transaction semantics simple. If filtered search across large multi-tenant corpora is a first-class requirement, Qdrant’s native filter integration pays for the added service. If you want to eliminate index operations entirely from your on-call rotation, Pinecone is a fully managed API.
Trade-offs, failure modes, gotchas
Recall degrades silently. A misconfigured ef_search or nprobe can drop recall from 98% to 70% with no error and no warning — queries just start returning worse results. Measure recall@k against a held-out ground-truth set when you tune index parameters, and re-run that measurement whenever you change index config or corpus composition.
Index build cost for HNSW is significant. Building HNSW over 10M vectors with m=16, ef_construction=200 can take 30–60 minutes on a single machine. For streaming ingestion pipelines, this matters: you cannot rebuild the index after every insert. Plan for background index builds and understand whether your database handles incremental HNSW inserts (most do, with per-insert cost proportional to m).
Filtered ANN is harder than it looks. Naive post-filtering: run ANN for top-500, then filter. If only 0.1% of the corpus belongs to the target user, top-500 may contain zero matching results. Pre-filtering restricts graph traversal to matching nodes — effective when the filter is selective, but requires the database to integrate filter logic into the traversal. pgvector 0.7+ supports iterative index scans that approximate this; Qdrant does it natively. Know which mode your database uses before designing a multi-tenant retrieval system.
Embedding drift invalidates every stored vector. Upgrading the embedding model — from text-embedding-3-small to text-embedding-3-large, or any other switch — makes all stored vectors incompatible with new query vectors. Cosine similarity across different model versions is meaningless. Treat a model change as a full corpus re-embed plus index rebuild: it is a breaking schema migration, not a rolling update. Version-lock the embedding model identifier in your application config.
Memory requirements are large and non-obvious. A 10M-vector HNSW index at float32, 1536 dims, m=16 requires ~60 GB for raw vectors plus ~10 GB for graph edges. Most teams underestimate this when sizing. Use dimension reduction (MRL truncation to 512 dims from text-embedding-3-small), IVF-PQ compression, or on-disk index variants when memory is constrained.
Don’t over-architect at small scale. For corpora under 100k vectors, a flat numpy scan or SQLite with pgvector is fast enough at < 5ms query latency. The operational complexity of a dedicated vector database is not justified until you hit meaningful scale or require filtered ANN with selective predicates.
Further reading
- ANN Benchmarks — the canonical benchmark for ANN algorithms: recall vs. queries-per-second across HNSW, IVF, ScaNN, Annoy, and others on standardized datasets. The first stop when choosing or tuning an index algorithm.
- Pinecone — What is a Vector Database? — a thorough walkthrough of how vector databases differ from traditional databases, covering HNSW, IVF-PQ, and the indexing trade-off space with useful diagrams.
- Eugene Yan — Applied ML — sustained writing on applied retrieval, two-stage ranking architectures, and lessons from running recommendation and search systems in production. The two-stage retrieval pattern (ANN recall → re-rank) is covered in depth.
- Chip Huyen — Building LLM Applications for Production — covers the full production retrieval stack with honest accounting of where latency and cost accumulate.
What to read next
- Chunking Strategies for Retrieval — the policy that decides what each row of your index actually contains: fixed-size, recursive, semantic, structural, and parent-document chunking, with the trade-offs spelled out.
- Text Embeddings: Turning Meaning into Geometry — the prerequisite: how embeddings are produced, what cosine similarity measures, and why L2-normalization matters before indexing.
- LLM Inference: Tokens, Context, and Sampling — the context window model that motivates retrieval in the first place: what fits in working memory and what must be fetched on demand.