Hybrid Search: BM25 Meets Dense Vectors
Why dense retrieval misses rare terms and exact matches, how BM25 and embeddings fuse via RRF, and the hybrid patterns that ship in production.
A user types CVE-2024-3094 into your support search and gets back articles about supply-chain attacks in general — none of which mention that specific CVE. The dense retrieval index, tuned beautifully on semantic similarity, has done exactly what it was designed to do: returned the documents most like the query in meaning. The problem is the user did not want similar documents. They wanted the literal token. Hybrid search exists because the queries you actually receive in production are a mix of semantic intent and literal lookup, and no single retrieval model is good at both.
Opening bridge
In the previous article we focused on the unit of retrieval — what each row of the vector index actually contains. Today’s piece sits one level up. Even with perfectly chunked content and a perfectly tuned ANN index, dense retrieval has a known failure mode: it underperforms on rare terms, exact identifiers, and out-of-distribution vocabulary. Sparse lexical retrieval has the mirror problem. Hybrid search is the standard production answer, and the interesting question is not whether to combine the two but how to combine them.
What hybrid search actually is
Hybrid search runs two retrievers in parallel against the same corpus and fuses their results into one ranked list.
- Sparse retrieval scores documents by lexical overlap with the query. BM25 is the dominant scoring function. The representation is a sparse vector over the vocabulary — most entries are zero, a few are weighted by term frequency and inverse document frequency.
- Dense retrieval scores by cosine similarity between query and document embeddings. The representation is a dense low-dimensional vector encoding meaning.
The signals are nearly orthogonal. A query for “how do I revoke an API key” hits dense retrieval well — synonyms like “rotate”, “deauthorize”, and “delete credential” cluster nearby in embedding space. A query for revoke-key-v2 --org=acme-prod hits BM25 well — the literal tokens are rare and discriminative, but their embedding is whatever the model thought of an opaque identifier (which is rarely useful).
The two retrievers do not need to agree on rank. They need to disagree productively, and the fusion step has to combine their outputs into something better than either alone.
The distributed systems parallel
The mental model that makes hybrid search click is the query planner with multiple access paths. A relational database faced with WHERE last_name = 'Bansal' AND created_at > '2024-01-01' has two candidate indexes — one on last_name, one on created_at — and a choice of strategies: pick the more selective index and post-filter, intersect the two index scans, or use a composite covering index that integrates both predicates.
Hybrid retrieval makes exactly that choice for the retrieval layer:
| Retrieval pattern | Database analogue |
|---|---|
| Dense only | Single secondary index, semantic axis |
| Sparse only (BM25) | Full-text inverted index, lexical axis |
| RRF / score fusion | Bitmap intersection across two indexes |
| Weighted score fusion | Cost-based ranking across heterogeneous indexes |
| Cross-encoder rerank on union | Re-evaluate predicates after a coarse fetch |
The fusion step is the merge join, and reciprocal rank fusion (RRF) is the variant that does not need calibrated scores — it works purely on rank position. That property is the whole reason it is the default in production: BM25 scores are unbounded length-normalized log-frequency, cosine scores are bounded [-1, 1], and they live in different units. Anything that fuses them by raw value is doing dimensional analysis on numbers that are not commensurable. RRF sidesteps the problem entirely.
Mechanics: BM25 in one minute
BM25 scores a document D against a query Q by summing, over every term t in Q:
| |
Three pieces are doing all the work.
IDF(t)— inverse document frequency. Rare terms count for more. This is whyCVE-2024-3094dominates the score: it appears in two documents and is therefore worth almost any number of “the"s.f(t, D)withk1saturation — term frequency contributes, but with diminishing returns.k1(typically 1.2–2.0) bends the curve so that the tenth occurrence of a term adds far less than the first.- Length normalization with
b—b(typically 0.75) penalizes long documents. Without it, longer documents win simply by accumulating term hits.
BM25 has no understanding of synonyms, paraphrase, or word order. It is exactly token-match scoring with the volume turned to “production-grade.” That is the strength and the weakness.
Reciprocal rank fusion
RRF combines m ranked lists by scoring every document d as:
| |
rank_i(d) is the document’s rank in list i (1-indexed), and k is a constant (60 is the standard default from the original paper). Documents that do not appear in a given list contribute zero from that list.
Three properties matter:
- Score-free. Only ranks are used, so heterogeneous scoring functions fuse without calibration.
- Top-rank dominance. With
k=60, the document at rank 1 contributes1/61, at rank 21/62, at rank 101/70. The curve is gentle — a top-1 in one list and a top-50 in another beats a top-5 / top-5. This is usually what you want. - Tunable. Lower
k(e.g. 10) sharpens top-rank dominance; higherk(e.g. 200) flattens it and lets mid-rank hits matter more.
When you genuinely have calibrated scores on both sides — for example, normalized cosine similarity and a min-max-rescaled BM25 — weighted score fusion can outperform RRF by 1–3% on well-instrumented benchmarks. The cost is that the weights need re-tuning every time you change the embedding model, the BM25 parameters, the corpus distribution, or the query mix. RRF is the safer default; weighted fusion is the squeeze when you have an eval set and want the last few points.
When sparse beats dense
Empirically, BM25 wins on:
- Named entities and rare proper nouns —
Llama-3.1-405B-Instruct,pgvector 0.7, customer names. - Identifiers and codes — order IDs, CVEs, SKUs, error codes, commit hashes.
- Domain jargon the embedding model hasn’t seen — internal product names, freshly coined terms.
- Exact-quote retrieval —
"the deployment failed at 03:14 UTC". - Out-of-distribution queries — anything the embedding model wasn’t trained on (legal text, biomedical, multi-lingual gaps).
Dense wins on:
- Paraphrase — “how do I cancel” vs “I want to terminate my subscription”.
- Cross-lingual retrieval when the embedding model is multilingual.
- Conceptual queries — “what does idempotency mean for retries”.
- Short queries with high ambiguity that need semantic disambiguation.
The two failure modes are nearly complementary, which is why hybrid wins on virtually every general-purpose retrieval benchmark with a non-trivial test set (BEIR is the canonical reference).
Code: RRF over BM25 and dense
The shortest convincing implementation is to run pgvector for dense retrieval and Postgres’s built-in tsvector/ts_rank for BM25-ish lexical retrieval, then fuse with RRF in application code. ts_rank is not strict BM25 — it is a related TF-IDF variant — but the fusion shape is identical to what you’d write against Elasticsearch.
| |
The TypeScript shape mirrors this against the pg driver and the OpenAI Node SDK:
| |
In production, the database often does fusion natively. Elasticsearch 8.8+ exposes RRF as a first-class query stage; OpenSearch ships RRF and weighted score fusion via search pipelines; Weaviate’s hybrid query takes an alpha parameter that interpolates between BM25 (alpha=0) and dense (alpha=1); Qdrant Query API supports server-side fusion (RRF or DBSF) over multiple prefetch stages. Prefer the native primitive when one exists — it pushes the fusion into the engine, avoids two extra round-trips, and lets the query planner reason about candidate counts.
Trade-offs, failure modes, gotchas
Candidate counts matter more than weights. The single most common bug in hybrid search is fusing a top-10 dense list with a top-10 sparse list and wondering why results are unstable. The intersection of two top-10 lists drawn from a 10M corpus is often empty. Fetch 50–200 candidates per retriever, fuse, then truncate to the final top_k. The cost is one extra database round trip’s worth of rows; the win is dramatically better fusion quality.
BM25 needs the tokenizer to make sense. Postgres to_tsvector('english', ...) lowercases, strips stop words, and stems. That destroys exact-identifier matching — CVE-2024-3094 becomes cve, 2024, 3094 as separate tokens after stemming, which may or may not survive. For identifier-heavy corpora, use a tokenizer that preserves token boundaries (Elasticsearch’s keyword analyzer, a custom simple configuration in pgvector, or a dedicated BM25 library like Tantivy).
Stop words bite both retrievers. “The CEO” and “CEO” produce very different BM25 scores once “the” is stripped; the dense embedding of “the CEO” and “CEO” are nearly identical. Inconsistent stop-word handling across the two retrievers shows up as random instability in results.
RRF is not free of the score problem when retrievers report different recall regimes. If your sparse retriever only returns 3 candidates because the query has 3 rare matching tokens, while your dense retriever returns 50, the fusion is dominated by the dense list. Pad the shorter list, or alternatively cap the longer one — both decisions are deliberate, not defaults.
Evaluate hybrid on the queries it should help, not on average. A hybrid system tuned on conversational queries can lose to pure dense on conversational queries, because the BM25 leg adds noise without adding signal. Stratify your eval set: conversational, entity/identifier, exact-quote, and out-of-distribution. Hybrid should win or tie on every stratum; if it loses badly on one, your fusion weights or candidate counts are wrong.
Reranking changes the equation. When a cross-encoder reranker sits after retrieval, the role of fusion changes: you no longer need a great ordering, just a good union. In that regime, union-then-rerank often beats RRF-then-rerank, because the reranker is the final arbiter and just needs the right candidates in the pool. This is the cascade pattern; an upcoming article will cover it in depth.
Two indexes, two failure modes. Hybrid systems double the operational surface. Index builds, reindex cadence, schema migrations, and monitoring all double. For corpora under ~100k documents, the operational cost may exceed the quality gain — single-leg dense retrieval with a strong embedding model is often good enough.
Further reading
- Cormack et al. — Reciprocal Rank Fusion (SIGIR 2009) — the original RRF paper. Short, readable, and worth reading directly to internalize why a rank-only formula outperforms calibrated score fusion in practice.
- Pinecone — Hybrid Search — a thorough walkthrough of sparse-dense fusion, alpha tuning, and the embedding-side options for sparse-dense joint models like SPLADE.
- Elastic — Improving information retrieval in the Elastic Stack: hybrid retrieval — Elastic’s engineering blog on tuning RRF in production search, with eval numbers and a useful discussion of candidate-pool sizing.
- Weaviate — Hybrid Search Explained — a clear treatment of the
alphaweighting model and when it makes sense versus RRF, from a vendor that exposes both.
What to read next
- Chunking Strategies for Retrieval — the policy that decides what each row of both the dense and sparse indexes contains. Hybrid quality is bounded above by chunk quality.
- Vector Databases & ANN Indexes — the dense leg’s implementation: HNSW, IVF, and the operational trade-offs of pgvector versus Qdrant versus Pinecone.
- Text Embeddings: Turning Meaning into Geometry — why dense retrieval works on paraphrase and fails on identifiers, grounded in the geometry of embedding space.
- LLM Inference: Tokens, Context, and Sampling — the context window model that determines how many fused candidates can actually fit in the generator prompt.