$ cat ai-engineering/hybrid-search.md

Hybrid Search: BM25 Meets Dense Vectors

Why dense retrieval misses rare terms and exact matches, how BM25 and embeddings fuse via RRF, and the hybrid patterns that ship in production.

Jatin Bansal@blog:~/ai-engineering$ open hybrid-search

A user types CVE-2024-3094 into your support search and gets back articles about supply-chain attacks in general — none of which mention that specific CVE. The dense retrieval index, tuned beautifully on semantic similarity, has done exactly what it was designed to do: returned the documents most like the query in meaning. The problem is the user did not want similar documents. They wanted the literal token. Hybrid search exists because the queries you actually receive in production are a mix of semantic intent and literal lookup, and no single retrieval model is good at both.

Opening bridge

In the previous article we focused on the unit of retrieval — what each row of the vector index actually contains. Today’s piece sits one level up. Even with perfectly chunked content and a perfectly tuned ANN index, dense retrieval has a known failure mode: it underperforms on rare terms, exact identifiers, and out-of-distribution vocabulary. Sparse lexical retrieval has the mirror problem. Hybrid search is the standard production answer, and the interesting question is not whether to combine the two but how to combine them.

What hybrid search actually is

Hybrid search runs two retrievers in parallel against the same corpus and fuses their results into one ranked list.

Sparse retrieval scores documents by lexical overlap with the query. BM25 is the dominant scoring function. The representation is a sparse vector over the vocabulary — most entries are zero, a few are weighted by term frequency and inverse document frequency.
Dense retrieval scores by cosine similarity between query and document embeddings. The representation is a dense low-dimensional vector encoding meaning.

The signals are nearly orthogonal. A query for “how do I revoke an API key” hits dense retrieval well — synonyms like “rotate”, “deauthorize”, and “delete credential” cluster nearby in embedding space. A query for revoke-key-v2 --org=acme-prod hits BM25 well — the literal tokens are rare and discriminative, but their embedding is whatever the model thought of an opaque identifier (which is rarely useful).

The two retrievers do not need to agree on rank. They need to disagree productively, and the fusion step has to combine their outputs into something better than either alone.

The distributed systems parallel

The mental model that makes hybrid search click is the query planner with multiple access paths. A relational database faced with WHERE last_name = 'Bansal' AND created_at > '2024-01-01' has two candidate indexes — one on last_name, one on created_at — and a choice of strategies: pick the more selective index and post-filter, intersect the two index scans, or use a composite covering index that integrates both predicates.

Hybrid retrieval makes exactly that choice for the retrieval layer:

Retrieval pattern	Database analogue
Dense only	Single secondary index, semantic axis
Sparse only (BM25)	Full-text inverted index, lexical axis
RRF / score fusion	Bitmap intersection across two indexes
Weighted score fusion	Cost-based ranking across heterogeneous indexes
Cross-encoder rerank on union	Re-evaluate predicates after a coarse fetch

The fusion step is the merge join, and reciprocal rank fusion (RRF) is the variant that does not need calibrated scores — it works purely on rank position. That property is the whole reason it is the default in production: BM25 scores are unbounded length-normalized log-frequency, cosine scores are bounded [-1, 1], and they live in different units. Anything that fuses them by raw value is doing dimensional analysis on numbers that are not commensurable. RRF sidesteps the problem entirely.

Mechanics: BM25 in one minute

BM25 scores a document D against a query Q by summing, over every term t in Q:

text

1
score(D, Q) = Σ IDF(t) * (f(t, D) * (k1 + 1)) / (f(t, D) + k1 * (1 - b + b * |D| / avgdl))

Three pieces are doing all the work.

IDF(t) — inverse document frequency. Rare terms count for more. This is why CVE-2024-3094 dominates the score: it appears in two documents and is therefore worth almost any number of “the"s.
f(t, D) with k1 saturation — term frequency contributes, but with diminishing returns. k1 (typically 1.2–2.0) bends the curve so that the tenth occurrence of a term adds far less than the first.
Length normalization with b — b (typically 0.75) penalizes long documents. Without it, longer documents win simply by accumulating term hits.

BM25 has no understanding of synonyms, paraphrase, or word order. It is exactly token-match scoring with the volume turned to “production-grade.” That is the strength and the weakness.

Reciprocal rank fusion

RRF combines m ranked lists by scoring every document d as:

text

1
RRF(d) = Σ 1 / (k + rank_i(d))

rank_i(d) is the document’s rank in list i (1-indexed), and k is a constant (60 is the standard default from the original paper). Documents that do not appear in a given list contribute zero from that list.

Three properties matter:

Score-free. Only ranks are used, so heterogeneous scoring functions fuse without calibration.
Top-rank dominance. With k=60, the document at rank 1 contributes 1/61, at rank 2 1/62, at rank 10 1/70. The curve is gentle — a top-1 in one list and a top-50 in another beats a top-5 / top-5. This is usually what you want.
Tunable. Lower k (e.g. 10) sharpens top-rank dominance; higher k (e.g. 200) flattens it and lets mid-rank hits matter more.

When you genuinely have calibrated scores on both sides — for example, normalized cosine similarity and a min-max-rescaled BM25 — weighted score fusion can outperform RRF by 1–3% on well-instrumented benchmarks. The cost is that the weights need re-tuning every time you change the embedding model, the BM25 parameters, the corpus distribution, or the query mix. RRF is the safer default; weighted fusion is the squeeze when you have an eval set and want the last few points.

When sparse beats dense

Empirically, BM25 wins on:

Named entities and rare proper nouns — Llama-3.1-405B-Instruct, pgvector 0.7, customer names.
Identifiers and codes — order IDs, CVEs, SKUs, error codes, commit hashes.
Domain jargon the embedding model hasn’t seen — internal product names, freshly coined terms.
Exact-quote retrieval — "the deployment failed at 03:14 UTC".
Out-of-distribution queries — anything the embedding model wasn’t trained on (legal text, biomedical, multi-lingual gaps).

Dense wins on:

Paraphrase — “how do I cancel” vs “I want to terminate my subscription”.
Cross-lingual retrieval when the embedding model is multilingual.
Conceptual queries — “what does idempotency mean for retries”.
Short queries with high ambiguity that need semantic disambiguation.

The two failure modes are nearly complementary, which is why hybrid wins on virtually every general-purpose retrieval benchmark with a non-trivial test set (BEIR is the canonical reference).

Code: RRF over BM25 and dense

The shortest convincing implementation is to run pgvector for dense retrieval and Postgres’s built-in tsvector/ts_rank for BM25-ish lexical retrieval, then fuse with RRF in application code. ts_rank is not strict BM25 — it is a related TF-IDF variant — but the fusion shape is identical to what you’d write against Elasticsearch.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# pip install psycopg2-binary pgvector openai
import psycopg2
from pgvector.psycopg2 import register_vector
from openai import OpenAI

conn = psycopg2.connect("postgresql://localhost/mydb")
register_vector(conn)
cur = conn.cursor()

cur.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id        serial PRIMARY KEY,
        chunk     text,
        chunk_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', chunk)) STORED,
        embedding vector(1536)
    );
    CREATE INDEX IF NOT EXISTS docs_fts  ON documents USING gin(chunk_tsv);
    CREATE INDEX IF NOT EXISTS docs_hnsw ON documents USING hnsw (embedding vector_cosine_ops);
""")
conn.commit()

openai_client = OpenAI()

def rrf_fuse(rank_lists: list[list[int]], k: int = 60) -> list[tuple[int, float]]:
    scores: dict[int, float] = {}
    for ranks in rank_lists:
        for rank, doc_id in enumerate(ranks, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda kv: kv[1], reverse=True)

def hybrid_search(query: str, top_k: int = 10, candidate_k: int = 50):
    qvec = openai_client.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding

    # Dense candidates
    cur.execute("SET hnsw.ef_search = 100")
    cur.execute(
        """
        SELECT id FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        (qvec, candidate_k),
    )
    dense_ranks = [row[0] for row in cur.fetchall()]

    # Lexical candidates (BM25-ish via ts_rank_cd)
    cur.execute(
        """
        SELECT id
        FROM documents
        WHERE chunk_tsv @@ plainto_tsquery('english', %s)
        ORDER BY ts_rank_cd(chunk_tsv, plainto_tsquery('english', %s)) DESC
        LIMIT %s
        """,
        (query, query, candidate_k),
    )
    sparse_ranks = [row[0] for row in cur.fetchall()]

    fused = rrf_fuse([dense_ranks, sparse_ranks], k=60)
    return fused[:top_k]

The TypeScript shape mirrors this against the pg driver and the OpenAI Node SDK:

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// npm install pg openai
import pg from "pg";
import OpenAI from "openai";

const pool = new pg.Pool({ connectionString: "postgresql://localhost/mydb" });
const openai = new OpenAI();
const vecToSql = (v: number[]) => `[${v.join(",")}]`;

function rrfFuse(rankLists: number[][], k = 60): Array<[number, number]> {
  const scores = new Map<number, number>();
  for (const ranks of rankLists) {
    ranks.forEach((id, idx) => {
      const rank = idx + 1;
      scores.set(id, (scores.get(id) ?? 0) + 1 / (k + rank));
    });
  }
  return [...scores.entries()].sort((a, b) => b[1] - a[1]);
}

async function hybridSearch(query: string, topK = 10, candidateK = 50) {
  const embed = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  const qvec = vecToSql(embed.data[0].embedding);
  const client = await pool.connect();
  try {
    await client.query("SET hnsw.ef_search = 100");
    const dense = await client.query(
      `SELECT id FROM documents
       ORDER BY embedding <=> $1::vector LIMIT $2`,
      [qvec, candidateK]
    );
    const sparse = await client.query(
      `SELECT id FROM documents
       WHERE chunk_tsv @@ plainto_tsquery('english', $1)
       ORDER BY ts_rank_cd(chunk_tsv, plainto_tsquery('english', $1)) DESC
       LIMIT $2`,
      [query, candidateK]
    );
    const denseIds = dense.rows.map((r) => r.id as number);
    const sparseIds = sparse.rows.map((r) => r.id as number);
    return rrfFuse([denseIds, sparseIds], 60).slice(0, topK);
  } finally {
    client.release();
  }
}

In production, the database often does fusion natively. Elasticsearch 8.8+ exposes RRF as a first-class query stage; OpenSearch ships RRF and weighted score fusion via search pipelines; Weaviate’s hybrid query takes an alpha parameter that interpolates between BM25 (alpha=0) and dense (alpha=1); Qdrant Query API supports server-side fusion (RRF or DBSF) over multiple prefetch stages. Prefer the native primitive when one exists — it pushes the fusion into the engine, avoids two extra round-trips, and lets the query planner reason about candidate counts.

Trade-offs, failure modes, gotchas

Candidate counts matter more than weights. The single most common bug in hybrid search is fusing a top-10 dense list with a top-10 sparse list and wondering why results are unstable. The intersection of two top-10 lists drawn from a 10M corpus is often empty. Fetch 50–200 candidates per retriever, fuse, then truncate to the final top_k. The cost is one extra database round trip’s worth of rows; the win is dramatically better fusion quality.

BM25 needs the tokenizer to make sense. Postgres to_tsvector('english', ...) lowercases, strips stop words, and stems. That destroys exact-identifier matching — CVE-2024-3094 becomes cve, 2024, 3094 as separate tokens after stemming, which may or may not survive. For identifier-heavy corpora, use a tokenizer that preserves token boundaries (Elasticsearch’s keyword analyzer, a custom simple configuration in pgvector, or a dedicated BM25 library like Tantivy).

Stop words bite both retrievers. “The CEO” and “CEO” produce very different BM25 scores once “the” is stripped; the dense embedding of “the CEO” and “CEO” are nearly identical. Inconsistent stop-word handling across the two retrievers shows up as random instability in results.

RRF is not free of the score problem when retrievers report different recall regimes. If your sparse retriever only returns 3 candidates because the query has 3 rare matching tokens, while your dense retriever returns 50, the fusion is dominated by the dense list. Pad the shorter list, or alternatively cap the longer one — both decisions are deliberate, not defaults.

Evaluate hybrid on the queries it should help, not on average. A hybrid system tuned on conversational queries can lose to pure dense on conversational queries, because the BM25 leg adds noise without adding signal. Stratify your eval set: conversational, entity/identifier, exact-quote, and out-of-distribution. Hybrid should win or tie on every stratum; if it loses badly on one, your fusion weights or candidate counts are wrong.

Reranking changes the equation. When a cross-encoder reranker sits after retrieval, the role of fusion changes: you no longer need a great ordering, just a good union. In that regime, union-then-rerank often beats RRF-then-rerank, because the reranker is the final arbiter and just needs the right candidates in the pool. This is the cascade pattern; an upcoming article will cover it in depth.

Two indexes, two failure modes. Hybrid systems double the operational surface. Index builds, reindex cadence, schema migrations, and monitoring all double. For corpora under ~100k documents, the operational cost may exceed the quality gain — single-leg dense retrieval with a strong embedding model is often good enough.