$ cat ai-engineering/reranking.md

Reranking: Cross-Encoders and Cascades

Why cross-encoders dominate the precision stage of retrieval, when a reranker pays off, and how to compose cascades that respect the latency budget.

Jatin Bansal@blog:~/ai-engineering$ open reranking

Your retriever returns ten candidates. Eyeballed, the top three look right. Then you run an eval set and the gold passage is at rank 7 — outside the slice you pack into the LLM prompt. The retriever wasn’t broken; it was working exactly as designed. Bi-encoder retrieval is a coarse score over a corpus of millions, optimized for recall, run in milliseconds. Asking it to also order the top-K perfectly is asking a hash lookup to be a careful read. The fix is a second stage: a smaller, slower, more careful model that sees query and candidate together and reorders them. That second stage is what people call a reranker, and on most production RAG benchmarks it is the single highest-leverage component you can add.

Opening bridge

In the previous article we ended the retrieval side: BM25 and dense vectors, fused via reciprocal rank fusion. That stage’s job is recall — pull a good pool of candidates from a large corpus cheaply. Today’s piece is the immediate next step: take that pool of 50–200 candidates and reorder it with a model that can afford to look much more carefully at each one. The hybrid article noted in passing that “union-then-rerank often beats RRF-then-rerank” because the reranker is the final arbiter. This is the article that earns that claim.

What a reranker is, precisely

A reranker is a function score(query, candidate) → relevance applied per-candidate at query time. It does not build an index; it does not enumerate the corpus. It receives a small list of candidates (already shortlisted by retrieval) and produces a per-pair relevance score that you sort on.

The defining property is the joint input: the model attends over query and candidate tokens together in the same forward pass. That is the entire architectural difference from the embedding-based retriever you put in front of it.

Bi-encoder vs cross-encoder, in one diagram

The retrieval models discussed in text embeddings are bi-encoders: two parallel encoders, one for queries, one for documents, with cosine similarity glued on top. Documents are embedded ahead of time. The query is embedded once at runtime. Comparison is a dot product. This factorization is what lets ANN over vector databases scale to billions of vectors: the document side is fully precomputable.

A cross-encoder breaks the factorization on purpose. The query and document are concatenated into a single input, fed through one transformer, and a regression head produces a single scalar relevance score. Every query token can attend to every document token. The model can see “the device fails at 03:14 UTC” in the document and notice that the question asked specifically about 3:14 UTC, not just device failure. That cross-attention is the entire reason cross-encoders outperform bi-encoders on fine-grained ranking — and it is also the entire reason they cannot be indexed.

Property	Bi-encoder (retriever)	Cross-encoder (reranker)
Forward passes per query	1 (the query)	N (one per candidate)
Document embeddings cached	Yes	No — joint input
Typical params	100M–7B	100M–9B
Typical use	Recall, fan-out over corpus	Precision, reorder top-K
Index?	Yes (HNSW, IVF, etc.)	No
Latency dominated by	ANN structure	Model size × candidate count

This is not a strict dichotomy in practice. Late-interaction models like ColBERT and ColPali split the difference by keeping per-token embeddings on both sides and computing a MaxSim score at query time — more expressive than a single cosine, far cheaper than a full cross-encoder. They show up in the same pipeline slot a reranker would, with different cost characteristics.

The distributed systems parallel

Reranking is the textbook two-stage query plan. A relational database faced with SELECT * FROM orders WHERE customer_id = ? AND status = 'open' ORDER BY created_at DESC LIMIT 10 does not scan the table; it uses an index on customer_id to fetch a candidate set, then re-evaluates the status predicate and sorts on created_at. The index gets the neighborhood right; the post-filter gets the answer right.

Retrieval is the index scan. Reranking is the post-filter. The bi-encoder is allowed to be approximate because the cross-encoder will re-check.

Another lens is cache hierarchy. A CPU’s L1 cache is fast and small; L2 is slower and larger; main memory is slowest and largest. A retrieval pipeline composes the same way in reverse — the largest, cheapest scan comes first; each successive stage is more expensive per item but operates on a smaller set. The cost-per-item climbs as the set shrinks, and the budget stays roughly constant per stage. That is the cascade pattern.

A typical four-stage cascade looks like:

Stage	Function	Candidates	Per-item cost
1. Retrieval	Hybrid (BM25 + dense ANN)	10M → 100–200	µs–ms
2. Rerank	Cross-encoder	100 → 25	ms
3. LLM-as-judge (optional)	Small LLM with rubric	25 → 8	10s–100s of ms
4. Generation	Main model with packed context	8 → answer	seconds

The numbers vary; the structure rarely does. Every stage cuts the set by roughly an order of magnitude and pays roughly an order of magnitude more per item. If two adjacent stages have the same cost-per-item, one of them is wasted work.

The 2026 reranker landscape

A few options dominate, split between hosted APIs and open-weights models you can self-host.

Hosted, closed weights:

Cohere Rerank v3.5 — 4096-token context, multilingual (100+ languages), strong on BEIR and enterprise domains (finance, e-commerce). Priced at $2.00 per 1,000 searches at the time of writing. The default “just works” choice if you don’t want to think about it.
Voyage rerank-2.5 and rerank-2.5-lite — 32K-token context (8× Cohere’s), instruction-following (you can steer ranking with natural-language criteria), and competitive accuracy on Voyage’s internal 93-dataset benchmark. The long context is the differentiator if your candidates are full pages rather than chunks.

Open weights, run yourself:

BAAI bge-reranker-v2-m3 — 0.6B-parameter multilingual cross-encoder built on bge-m3. The standard self-hosted baseline. Easy to deploy, fast inference, well-understood.
BAAI bge-reranker-v2.5-gemma2-lightweight — heavier 9B model on gemma2-9b with token-compression and layerwise pruning. Better quality, much higher cost.
mixedbread mxbai-rerank-large-v2 — 1.5B-parameter model trained with a three-stage RL recipe (GRPO + contrastive + preference). Open weights, multilingual, 8K context.
Jina reranker v2 base multilingual — 278M parameters, 1024-token context. Smaller and faster than the open-weights alternatives; good when latency is tight and corpora are short.

I won’t claim a single winner — the right choice depends on your domain, your latency budget, and whether you are willing to operate GPU inference yourself. The honest answer is: start with Cohere Rerank v3.5 if you are an API shop; start with bge-reranker-v2-m3 if you can self-host; run your own eval before betting the production system on either.

Latency math: where reranking fits in the budget

Per-pair cross-encoder cost depends on model size, sequence length, and hardware. Rough orders of magnitude on a modern GPU:

Small (~22M MiniLM-class) cross-encoder: a few ms per pair; sub-second for 100 pairs even on CPU.
Mid-size (300M–1.5B) cross-encoder: 5–20 ms per pair on GPU; 50–200 ms total for a top-50 rerank.
Large (7B+ LLM-as-reranker): 50–200 ms per pair; reranking >20 candidates breaks an interactive budget.

A useful reference benchmark: published numbers for ms-marco-MiniLM-L6-v2 show roughly 12 ms for 1 document up to ~740 ms for 100 documents on a CPU, with GPU bringing 30–50 ms for ~30 candidates and 100–200 ms still typical on CPU at that size. The shape is linear in candidate count, which is why every cascade caps candidate count explicitly — the reranker has no “approximate” mode.

For an interactive RAG endpoint with a ~1.5 s total budget, you can typically afford to rerank 50–100 candidates with a mid-size open-weights model on GPU, or 100 candidates with a hosted API on the order of 100–300 ms. Beyond that, latency dominates and you need a smaller reranker, a smaller candidate pool, or batch processing.

Code: hosted rerank with Cohere

The hosted path is the lowest-friction. The Cohere Python SDK v2 exposes a single rerank call. Install: pip install cohere.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import cohere

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

def cohere_rerank(query: str, candidates: list[str], top_n: int = 5):
    response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    # response.results is sorted by relevance_score descending
    return [(r.index, r.relevance_score) for r in response.results]

# Usage: candidates are the strings returned by your hybrid retrieval stage.
ranked = cohere_rerank(
    query="how do I rotate a Postgres replication slot's primary?",
    candidates=hybrid_top_100,  # from the previous article's hybrid_search()
    top_n=10,
)

The TypeScript shape is symmetric via the Cohere TypeScript SDK. Install: npm install cohere-ai.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import { CohereClientV2 } from "cohere-ai";

const co = new CohereClientV2({ token: process.env.COHERE_API_KEY! });

async function cohereRerank(
  query: string,
  candidates: string[],
  topN = 5,
): Promise<Array<{ index: number; score: number }>> {
  const response = await co.rerank({
    model: "rerank-v3.5",
    query,
    documents: candidates,
    topN,
  });
  return response.results.map((r) => ({
    index: r.index,
    score: r.relevanceScore,
  }));
}

relevance_score is a value in [0, 1] that Cohere calibrates as a probability. You can threshold on it — drop candidates below 0.3 — though calibration in your domain may differ from the model’s training distribution and a thresholding cut typically wants its own held-out tuning.

Code: self-hosted rerank with sentence-transformers

When you want to run reranking locally — for cost, privacy, or because you are already on GPUs — the Sentence Transformers CrossEncoder wrapper is the standard. Install: pip install sentence-transformers torch.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from sentence_transformers import CrossEncoder
import torch

# Load once at process start. bge-reranker-v2-m3 is multilingual and ~0.6B params.
model = CrossEncoder(
    "BAAI/bge-reranker-v2-m3",
    device="cuda" if torch.cuda.is_available() else "cpu",
    max_length=512,
)

def local_rerank(query: str, candidates: list[str], top_n: int = 5):
    pairs = [(query, c) for c in candidates]
    # Returns a numpy array of scores aligned with `candidates`.
    scores = model.predict(pairs, batch_size=32, show_progress_bar=False)
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_n]

The model is initialized once and reused across requests — loading is the slow part (multi-second), inference is per-batch fast. batch_size=32 is a reasonable starting point on a single GPU; tune for memory and throughput. The max_length=512 cap matters: candidates longer than that get truncated, which is fine for short chunks and a problem for long passages (see gotchas below).

Cascade composition: a sketch

The whole pipeline ties together as:

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def rag_pipeline(query: str, *, retrieve_k: int = 200, rerank_k: int = 25,
                 final_k: int = 8):
    # Stage 1: hybrid retrieval (BM25 + dense, fused by RRF)
    retrieved = hybrid_search(query, top_k=retrieve_k, candidate_k=retrieve_k * 2)
    docs = [load_chunk(doc_id) for doc_id, _ in retrieved]

    # Stage 2: cross-encoder rerank
    ranked = local_rerank(query, docs, top_n=rerank_k)
    reranked = [docs[i] for i, _ in ranked]

    # Stage 3 (optional): LLM judge over rubric — skipped by default
    final_docs = reranked[:final_k]

    return final_docs

The numbers 200 → 25 → 8 are not magic; they are the answer to “how much can I afford at each stage?” given your reranker latency and your end-to-end budget. Two principles set them:

Stage 1 over-fetches by 5–10×. Hybrid retrieval is cheap; cross-encoder reranking is not. You want stage 1’s recall@retrieve_k to be very high (say 0.95) on your eval set, even if precision@k is poor. The reranker will sort it out.
Stage 2 cuts to what the generator can plausibly use. If your generator packs 5–10 chunks into the prompt, the reranker output of 25 is a small over-fetch that lets a final lightweight dedup or diversity step have room to work.

Trade-offs, failure modes, gotchas

A reranker cannot fix recall. If the gold passage is not in the top-200 from retrieval, the reranker never sees it. Reranking amplifies retrieval quality; it does not substitute for it. Always measure recall@retrieve_k first.

Token truncation is silent and lossy. Most hosted rerankers have 4–32K-token windows; most open-weights models default to 512 tokens. A 1,500-token document being reranked by a 512-token model is silently truncated to the first 512 tokens. If the relevant span is at the end, the reranker will score it as if the relevant span did not exist. The same chunking discipline from the chunking strategies article applies: keep candidates inside the reranker’s window or use a model with a longer one.

Scores are not always probabilities. Cohere calibrates [0, 1]. Open-weights cross-encoders trained on MS MARCO output raw logits — you can apply a sigmoid yourself, but the resulting numbers are pseudo-probabilities at best. Threshold cuts (“drop everything below 0.5”) need to be tuned per model and per domain. Sorting on the score is robust; absolute cutoffs are not.

Domain shift hits rerankers hard. A reranker trained primarily on web search and news will underperform on legal contracts, biomedical text, or your internal log lines. Voyage’s instruction-following models partially address this by letting you steer with natural-language criteria; for the rest, fine-tune on a domain-labeled set or accept the gap. The standard signal: rerank improves nDCG by 10+ points on MS MARCO and 0–2 points on your domain. That’s the gap.

LLM-as-reranker is a real option, with real cost. A frontier-model judge with a custom rubric is the strongest reranker available — it follows arbitrary criteria, can do multi-hop reasoning over candidates, and adapts instantly to new domains. It is also 100×–1000× the cost of a dedicated reranker per call. The pragmatic pattern is to use it in stage 3 over a small set (5–10 candidates) after a cheap cross-encoder has done stage 2’s 200 → 25 cut.

Top-K matters more than you think. Reranking the top 10 and reranking the top 100 are two different operations. The first is “polish what’s already good”; the second is “find the gem you missed.” If your retriever’s recall@10 is already 0.9, reranking top 10 barely moves the needle. If recall@100 is 0.95 but recall@10 is 0.6, reranking top 100 → 10 is the entire game. Pick the rerank pool size against retrieval’s recall curve, not by feel.

Cost dominates at scale. A hosted rerank at $2/1k searches is cheap per query and meaningful at 100M queries. A self-hosted GPU reranker has fixed amortized cost — at sufficient throughput, it is cheaper; below it, the GPU is idle and expensive. Run the math at your projected QPS before committing to either side.

Don’t double-rerank by accident. It’s easy to add a Cohere rerank and forget that LlamaIndex or LangChain wrapped the same call inside a higher-level abstraction. Two cross-encoder passes in a single query are silently expensive and almost never improve quality.

Reranking: Cross-Encoders and Cascades

Opening bridge

What a reranker is, precisely

Bi-encoder vs cross-encoder, in one diagram

The distributed systems parallel

The 2026 reranker landscape

Latency math: where reranking fits in the budget

Code: hosted rerank with Cohere

Code: self-hosted rerank with sentence-transformers

Cascade composition: a sketch

Trade-offs, failure modes, gotchas

Further reading from the field

What to read next