jatin.blog ~ $
$ cat ai-engineering/vector-databases-ann.md

Vector Databases & ANN Indexes

How HNSW, IVF, and ScaNN trade recall for speed, why exact KNN doesn't scale, and how to choose between pgvector, Qdrant, and Pinecone for production use.

Jatin Bansal@blog:~/ai-engineering$ open vector-databases-ann

A flat array of embeddings scales to about 50k–100k vectors before linear-scan latency becomes a real problem. Above that, you need an index structure that trades a bounded amount of recall for O(log n) or better query time. That is what a vector database is: an ANN index wrapped in a storage and serving layer.

Opening bridge

In the previous article we established how embedding models map text to high-dimensional vectors and why cosine similarity measures semantic proximity. But knowing how to produce a query embedding is half the problem. Efficiently finding the 5 closest vectors in a corpus of 10 million is the other half. This article covers the index structures that make that search practical — HNSW, IVF, and ScaNN — and the systems trade-offs of the major databases that host them.

Why exact KNN doesn’t scale

Brute-force k-nearest-neighbor — compute cosine similarity between the query vector and every stored vector, sort, return top-k — is exact and requires no preprocessing. It is also O(n·d): for 10M vectors at 1536 dimensions, that is 15 billion multiply-accumulate operations per query. On a modern CPU this takes hundreds of milliseconds; on a GPU, 10–50ms after batching overhead. For interactive search that is too slow, and for batch pipelines it burns compute budget fast.

The trade-off ANN algorithms accept is “approximate” — they do not guarantee finding the exact nearest neighbor on every query. In practice, a well-tuned index returns 95–99% of the true top-k neighbors at 1/100th the compute cost. For retrieval in AI systems, that is almost always the right trade.

The distributed systems parallel

HNSW is a hierarchical skip list for vector space. A skip list maintains multiple levels of linked lists with exponentially decreasing density, enabling O(log n) traversal by jumping over large portions of the structure at higher levels. HNSW does the same in high-dimensional space: the top layers are a coarse graph connecting far-apart landmark nodes; the bottom layer is a dense proximity graph over the full corpus. A query enters at the top layer, greedily descends toward closer nodes at each level, and emerges at the bottom layer already near the target region.

IVF is a partitioned index — the k-means analogue of a hash-partitioned database table. Pre-cluster the corpus into nlist Voronoi cells. At query time, find the nearest centroids, then scan only those cells’ inverted lists. It is fast to build but recall depends entirely on how many cells you probe.

The filtered ANN problem is the composite-index problem from relational databases. If you need the 5 nearest embeddings for a given user, you cannot trivially layer a WHERE user_id = 42 filter on top of a graph-traversal index. Pre-filtering (restrict to matching rows, then do ANN) destroys graph connectivity when the filtered set is small. Post-filtering (run ANN for top-1000, then filter down) wastes work when most results fail the predicate. Databases with native filtered ANN thread the filter into the graph traversal itself — which ones do this well is one of the main differentiators below.

Systems conceptVector analogue
Skip listHNSW layered graph — O(log n) traversal
Hash partitionIVF coarse quantizer — cluster, then scan
B-tree composite indexFiltered ANN — attribute + vector, jointly optimized
Full table scanFlat/exact KNN — correct but O(n·d) per query

HNSW

Hierarchical Navigable Small World (original paper) is the dominant algorithm in production vector databases. Three parameters govern its behavior:

  • m — bidirectional edges per node in the proximity graph. Higher m → better recall, more memory, slower inserts. Typical range: 16–64.
  • ef_construction — size of the dynamic candidate list during index build. Higher → better graph quality, slower build. Typical: 64–200.
  • ef_search — candidate list size during query traversal. Higher → better recall, slower queries. Crucially, you tune this at query time without rebuilding the index, giving you a runtime recall/latency knob.

HNSW inserts are O(log n) and updates are supported (though expensive). The index is memory-resident by default: 10M vectors at float32, 1536 dimensions, with m=16 requires roughly 60 GB for raw vectors plus ~20% overhead for graph edges. Most databases now offer on-disk variants (pgvector’s diskann extension, Qdrant’s on-disk segments) for corpora that do not fit in RAM.

IVF and IVF-PQ

Inverted File Index trains a k-means quantizer to partition the space into nlist Voronoi cells, then stores an inverted list of which vectors belong to each cell. At query time, find the nearest nprobe centroids and scan only those lists.

  • nlist: a common starting heuristic is sqrt(n) for a corpus of n vectors.
  • nprobe: nprobe=1 gives maximum speed but poor recall; nprobe=nlist degrades to a full scan. Typical production values: 10–50.

IVF is often combined with Product Quantization (IVF-PQ), which compresses each vector into a compact code by independently quantizing sub-vectors. A 1536-dim float32 vector (6 KB) compresses to 64–128 bytes — a 50–100× reduction — at a modest recall cost. IVF-PQ is the workhorse algorithm when the index must fit in a fraction of the raw vector memory.

The build process requires all vectors upfront for the k-means clustering step. IVF does not support incremental inserts well; HNSW does.

ScaNN

Google’s ScaNN (Scalable Nearest Neighbors) introduces anisotropic vector quantization: rather than minimizing uniform reconstruction error, it optimizes specifically for maximizing inner product between query and document vectors. For asymmetric retrieval — where query and document embeddings come from different distributions — this produces better recall per compression ratio than IVF-PQ. ScaNN is available as a Python library and powers several Google production search systems.

Code examples

pgvector — Postgres-native ANN

pgvector adds a vector type and HNSW/IVF indexes to Postgres. If you already run Postgres, this is the zero-new-infrastructure path.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# pip install psycopg2-binary pgvector openai
import psycopg2
from pgvector.psycopg2 import register_vector
from openai import OpenAI

conn = psycopg2.connect("postgresql://localhost/mydb")
register_vector(conn)
cur = conn.cursor()

cur.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id        serial PRIMARY KEY,
        chunk     text,
        embedding vector(1536)
    )
""")
# Build HNSW with cosine distance — ef_search is tunable per-query without rebuilding
cur.execute("""
    CREATE INDEX IF NOT EXISTS docs_hnsw
    ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64)
""")
conn.commit()

openai_client = OpenAI()

def upsert(chunk: str) -> None:
    resp = openai_client.embeddings.create(model="text-embedding-3-small", input=chunk)
    vec = resp.data[0].embedding
    cur.execute(
        "INSERT INTO documents (chunk, embedding) VALUES (%s, %s)",
        (chunk, vec),
    )
    conn.commit()

def search(query: str, top_k: int = 5) -> list[tuple[str, float]]:
    resp = openai_client.embeddings.create(model="text-embedding-3-small", input=query)
    qvec = resp.data[0].embedding
    # ef_search > ef_construction is valid and improves recall at query time
    cur.execute("SET hnsw.ef_search = 100")
    cur.execute(
        """
        SELECT chunk, 1 - (embedding <=> %s::vector) AS score
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        (qvec, qvec, top_k),
    )
    return cur.fetchall()
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// npm install pg openai
import pg from "pg";
import OpenAI from "openai";

const pool = new pg.Pool({ connectionString: "postgresql://localhost/mydb" });
const openai = new OpenAI();

// pgvector accepts vectors as '[f1, f2, ...]' string literals
const vecToSql = (v: number[]): string => `[${v.join(",")}]`;

async function setup(): Promise<void> {
  const client = await pool.connect();
  try {
    await client.query("CREATE EXTENSION IF NOT EXISTS vector");
    await client.query(`
      CREATE TABLE IF NOT EXISTS documents (
        id        serial PRIMARY KEY,
        chunk     text,
        embedding vector(1536)
      )
    `);
    await client.query(`
      CREATE INDEX IF NOT EXISTS docs_hnsw
      ON documents USING hnsw (embedding vector_cosine_ops)
      WITH (m = 16, ef_construction = 64)
    `);
  } finally {
    client.release();
  }
}

async function search(
  query: string,
  topK = 5
): Promise<Array<{ chunk: string; score: number }>> {
  const resp = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  const qvec = vecToSql(resp.data[0].embedding);
  const client = await pool.connect();
  try {
    await client.query("SET hnsw.ef_search = 100");
    const result = await client.query(
      `SELECT chunk, 1 - (embedding <=> $1::vector) AS score
       FROM documents ORDER BY embedding <=> $1::vector LIMIT $2`,
      [qvec, topK]
    );
    return result.rows.map((r) => ({ chunk: r.chunk, score: parseFloat(r.score) }));
  } finally {
    client.release();
  }
}

Qdrant — native filtered ANN

Qdrant integrates payload filters directly into HNSW graph traversal, avoiding the post-filter recall degradation that plagues bolt-on approaches.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# pip install qdrant-client openai
import uuid
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue,
)
from openai import OpenAI

qdrant = QdrantClient(url="http://localhost:6333")
openai_client = OpenAI()
COLLECTION = "documents"

qdrant.create_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

def upsert(chunk: str, user_id: str) -> None:
    resp = openai_client.embeddings.create(model="text-embedding-3-small", input=chunk)
    qdrant.upsert(
        collection_name=COLLECTION,
        points=[PointStruct(
            id=str(uuid.uuid4()),
            vector=resp.data[0].embedding,
            payload={"chunk": chunk, "user_id": user_id},
        )],
    )

def search(query: str, user_id: str, top_k: int = 5):
    resp = openai_client.embeddings.create(model="text-embedding-3-small", input=query)
    return qdrant.search(
        collection_name=COLLECTION,
        query_vector=resp.data[0].embedding,
        query_filter=Filter(
            must=[FieldCondition(key="user_id", match=MatchValue(value=user_id))]
        ),
        limit=top_k,
    )
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// npm install @qdrant/js-client-rest openai
import { QdrantClient } from "@qdrant/js-client-rest";
import OpenAI from "openai";
import { randomUUID } from "crypto";

const qdrant = new QdrantClient({ url: "http://localhost:6333" });
const openai = new OpenAI();
const COLLECTION = "documents";

async function upsert(chunk: string, userId: string): Promise<void> {
  const resp = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: chunk,
  });
  await qdrant.upsert(COLLECTION, {
    points: [{
      id: randomUUID(),
      vector: resp.data[0].embedding,
      payload: { chunk, user_id: userId },
    }],
  });
}

async function search(query: string, userId: string, topK = 5) {
  const resp = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  return qdrant.search(COLLECTION, {
    vector: resp.data[0].embedding,
    filter: { must: [{ key: "user_id", match: { value: userId } }] },
    limit: topK,
  });
}

Choosing a vector database

DatabaseIndexHostingBest for
pgvectorHNSW, IVFSelf-host / RDS / SupabaseAlready on Postgres; corpora < 5M vectors
QdrantHNSW + native filtersSelf-host / Qdrant CloudFiltered ANN, multi-tenant workloads, large collections
PineconeProprietary ANNFully managedServerless, zero-ops, latency-sensitive at scale
WeaviateHNSWSelf-host / CloudGraphQL interface, hybrid BM25+vector search built-in
ChromaHNSW (hnswlib)Self-hostLocal dev, prototyping, < 1M vectors

The decision is mostly an operational one. If your team already operates Postgres at the scale of your corpus, pgvector keeps the operational surface minimal and transaction semantics simple. If filtered search across large multi-tenant corpora is a first-class requirement, Qdrant’s native filter integration pays for the added service. If you want to eliminate index operations entirely from your on-call rotation, Pinecone is a fully managed API.

Trade-offs, failure modes, gotchas

Recall degrades silently. A misconfigured ef_search or nprobe can drop recall from 98% to 70% with no error and no warning — queries just start returning worse results. Measure recall@k against a held-out ground-truth set when you tune index parameters, and re-run that measurement whenever you change index config or corpus composition.

Index build cost for HNSW is significant. Building HNSW over 10M vectors with m=16, ef_construction=200 can take 30–60 minutes on a single machine. For streaming ingestion pipelines, this matters: you cannot rebuild the index after every insert. Plan for background index builds and understand whether your database handles incremental HNSW inserts (most do, with per-insert cost proportional to m).

Filtered ANN is harder than it looks. Naive post-filtering: run ANN for top-500, then filter. If only 0.1% of the corpus belongs to the target user, top-500 may contain zero matching results. Pre-filtering restricts graph traversal to matching nodes — effective when the filter is selective, but requires the database to integrate filter logic into the traversal. pgvector 0.7+ supports iterative index scans that approximate this; Qdrant does it natively. Know which mode your database uses before designing a multi-tenant retrieval system.

Embedding drift invalidates every stored vector. Upgrading the embedding model — from text-embedding-3-small to text-embedding-3-large, or any other switch — makes all stored vectors incompatible with new query vectors. Cosine similarity across different model versions is meaningless. Treat a model change as a full corpus re-embed plus index rebuild: it is a breaking schema migration, not a rolling update. Version-lock the embedding model identifier in your application config.

Memory requirements are large and non-obvious. A 10M-vector HNSW index at float32, 1536 dims, m=16 requires ~60 GB for raw vectors plus ~10 GB for graph edges. Most teams underestimate this when sizing. Use dimension reduction (MRL truncation to 512 dims from text-embedding-3-small), IVF-PQ compression, or on-disk index variants when memory is constrained.

Don’t over-architect at small scale. For corpora under 100k vectors, a flat numpy scan or SQLite with pgvector is fast enough at < 5ms query latency. The operational complexity of a dedicated vector database is not justified until you hit meaningful scale or require filtered ANN with selective predicates.

Further reading

  • ANN Benchmarks — the canonical benchmark for ANN algorithms: recall vs. queries-per-second across HNSW, IVF, ScaNN, Annoy, and others on standardized datasets. The first stop when choosing or tuning an index algorithm.
  • Pinecone — What is a Vector Database? — a thorough walkthrough of how vector databases differ from traditional databases, covering HNSW, IVF-PQ, and the indexing trade-off space with useful diagrams.
  • Eugene Yan — Applied ML — sustained writing on applied retrieval, two-stage ranking architectures, and lessons from running recommendation and search systems in production. The two-stage retrieval pattern (ANN recall → re-rank) is covered in depth.
  • Chip Huyen — Building LLM Applications for Production — covers the full production retrieval stack with honest accounting of where latency and cost accumulate.