$ cat ai-engineering/query-transformations.md

Query Transformations: Rewriting, HyDE, and Multi-Query

The query-side preprocessing layer for RAG: how rewriting, HyDE, multi-query, decomposition, and step-back prompting trade cost for recall.

Jatin Bansal@blog:~/ai-engineering$ open query-transformations

A user asks: “why’s the dashboard broken?” Your retriever returns ten candidates, none of which contain the word dashboard. The corpus has the answer — six pages on “Grafana panel rendering errors when the Prometheus datasource times out” — but the query embedding landed in the wrong neighborhood of the vector space, and the right documents never made it into the candidate pool. No reranker recovers what retrieval never surfaced. The fix is not a better index; it’s a better query. Query transformations are the preprocessing layer that decides what you actually search for, given what the user typed.

Opening bridge

Yesterday’s piece on reranking ended with a hard truth: a reranker amplifies retrieval quality, it does not substitute for it. If the gold passage is not in the top-200, the cross-encoder never sees it. Query transformations are the complementary lever. Rather than improving the model that scores candidates, reshape the query so the candidate pool you fetch in the first place contains the right documents. Reranking is the precision stage; query transformation is the recall stage before retrieval, sitting one layer above the embedding model.

What a query transformation is

A query transformation is any preprocessing step that converts the user’s query into one or more new queries used at retrieval time. The user-facing query becomes a planning input; the actual ANN lookups run on derived queries. The transformation is almost always an LLM call — a small, fast model is enough — and its output is either text (a rewritten query, a hypothetical answer) or structured (a list of sub-questions, a decomposition tree).

The motivating asymmetry is vocabulary mismatch, the same problem classical IR has worried about since the 1990s, now expressed in vector form. Embeddings encode the lexical and structural patterns the model saw at training time. Questions tend to be short, terse, and missing context; documents tend to be long, declarative, and rich with surrounding context. Their embeddings land in different regions of the same space. Cross-encoder retraining helps; so does asymmetric embedding (separate query and document encoders, like the ones discussed in text embeddings). Query transformations attack the same problem from the prompt side: move the query closer to where the documents live.

The distributed-systems parallel

This is the rewrite phase of a SQL query optimizer. Before any physical plan is chosen, Postgres or CockroachDB rewrites the query — predicate pushdown, subquery flattening, view inlining, materialized-view substitution, constant folding. The semantics are preserved; the shape is changed so the execution engine can answer the request efficiently. Query transformations do the same to natural-language input: the user’s intent is preserved, the surface form is reshaped so the bi-encoder + ANN index can answer it.

Multi-query retrieval is also a textbook scatter-gather fan-out: issue N retrieval requests in parallel, merge the result sets, deduplicate, hand the union to the next stage. Sub-question decomposition is DAG execution with intermediate results — closer to a workflow than a single query, and it pays for that with latency.

The five families

I’ll walk the five patterns from cheapest to most expensive. They compose; production systems often run two or three in sequence.

1. Paraphrastic rewriting

Send the user query to a small LLM, ask for a cleaner or more search-friendly version. One query in, one query out. Useful when users phrase things conversationally — “hey, why’s the prod cluster acting weird?” → “production cluster degraded performance symptoms” — or when the query carries chat history that needs collapsing into a standalone question. Cheap, single extra hop, no downstream complexity. The pattern shows up under names like “query rewriting,” “query reformulation,” or “standalone question generation” depending on the framework.

2. Multi-query

Ask the LLM for K paraphrases or angles on the query. Issue all K queries against the retriever in parallel. Fuse the result sets — usually by deduplicating on document ID, optionally by reciprocal rank fusion over the K rank lists. LangChain’s MultiQueryRetriever generates three variants by default. The win is recall when the user’s vocabulary is far from the corpus’s, or when the question has multiple latent interpretations. The cost is K× retrieval traffic and one extra LLM hop.

3. HyDE — Hypothetical Document Embeddings

The clever one. Ask the LLM to hallucinate a fake but plausible answer document for the query, then embed the hallucinated document and use that embedding for ANN lookup. The original Gao et al. paper, “Precise Zero-Shot Dense Retrieval without Relevance Labels”, shows HyDE matching fine-tuned retrievers in zero-shot settings across web search, QA, and fact verification, in multiple languages.

The insight is geometric: a fake answer lives in roughly the same region of embedding space as the real answer, while the question lives somewhere else. You are not asking the LLM to be right; you are asking it to sound like a document on the right topic. Even a half-hallucinated paragraph about Grafana panel rendering errors will embed much closer to the real ops post than the four-word user query does. The trade is one extra LLM hop and the risk that on niche domains (legal, biomedical, internal jargon) the hallucinated text drifts into a different region of the space than the real corpus occupies.

4. Sub-question decomposition

Multi-hop questions break the single-query assumption. “Which framework with under 10 GitHub stars per week of release age has the lowest p99 import time?” cannot be answered by retrieving one passage; it needs joins. Decomposition asks the LLM to split the query into atomic sub-questions, retrieves for each, runs a small per-sub-question generation pass, and synthesizes a final answer over the intermediate ones. LlamaIndex’s SubQuestionQueryEngine is the canonical implementation; the cost grows linearly in the number of sub-questions, both in latency and dollars, but multi-hop accuracy moves from near-zero to actually working. Conceptually related to least-to-most prompting — solve the easier sub-problems first, then compose.

5. Step-back prompting

Zheng et al., “Take a Step Back”, proposes generating a more abstract version of the question alongside the original. Retrieve for both. The abstract version pulls in higher-level principles needed to interpret the concrete one. On PaLM-2L, the technique improved MMLU physics/chemistry by 7%/11% and TimeQA by 27%. The mechanism is recall: the abstract query surfaces the textbook explanation that the specific query missed, and the LLM gets to compose both at generation time.

Code: HyDE in raw Python

Two LLM calls per query are too expensive for some pipelines and obviously worth it for others. The simplest implementation uses the Anthropic Python SDK for the hypothetical-document step and any embedding/ANN backend you already have. Install: pip install anthropic openai numpy.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import os
import anthropic
import numpy as np
from openai import OpenAI

claude = anthropic.Anthropic()
openai = OpenAI()

HYDE_SYSTEM = (
    "Write a short, factual-sounding passage that would plausibly answer the "
    "user's question. It does not need to be true; it must sound like a real "
    "document on the same topic. 3-5 sentences. No preamble."
)

def hyde_embedding(query: str) -> list[float]:
    msg = claude.messages.create(
        model="claude-haiku-4-5",
        max_tokens=400,
        system=HYDE_SYSTEM,
        messages=[{"role": "user", "content": query}],
    )
    hypothetical = msg.content[0].text
    emb = openai.embeddings.create(
        model="text-embedding-3-small",
        input=hypothetical,
    )
    return emb.data[0].embedding

def hyde_search(query: str, index, top_k: int = 50):
    vec = hyde_embedding(query)
    # `index` is your ANN store; signature varies by provider.
    return index.query(vector=vec, top_k=top_k)

The small fast model (claude-haiku-4-5 here) is deliberate — the hypothetical only needs to land in the right neighborhood, not be correct. Latency budget: ~200–400 ms for the Haiku call plus your usual embedding + ANN time. Cache the hypothetical-doc embedding at temperature 0 to skip the LLM hop on repeat queries.

Code: multi-query retrieval in TypeScript

The TypeScript shape is symmetric. Using the Anthropic TypeScript SDK for the rewrite, fanning out queries, and merging by reciprocal rank fusion. Install: npm install @anthropic-ai/sdk.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import Anthropic from "@anthropic-ai/sdk";

const claude = new Anthropic();

const REWRITE_SYSTEM = `You generate exactly 4 alternative phrasings of the
user's question, optimized for search. Cover different angles: a more
specific phrasing, a more general phrasing, a synonym-swapped phrasing,
and a "what would a document about this look like" phrasing. Return one
per line, no numbering, no preamble.`;

async function generateQueries(query: string): Promise<string[]> {
  const msg = await claude.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 300,
    system: REWRITE_SYSTEM,
    messages: [{ role: "user", content: query }],
  });
  const text = msg.content[0].type === "text" ? msg.content[0].text : "";
  const variants = text.split("\n").map((s) => s.trim()).filter(Boolean);
  return [query, ...variants];
}

type Hit = { id: string; score: number };

function reciprocalRankFusion(rankLists: Hit[][], k = 60): Hit[] {
  const scores = new Map<string, number>();
  for (const list of rankLists) {
    list.forEach((hit, rank) => {
      scores.set(hit.id, (scores.get(hit.id) ?? 0) + 1 / (k + rank + 1));
    });
  }
  return [...scores.entries()]
    .map(([id, score]) => ({ id, score }))
    .sort((a, b) => b.score - a.score);
}

async function multiQueryRetrieve(
  query: string,
  retrieve: (q: string) => Promise<Hit[]>,
  topK = 50,
): Promise<Hit[]> {
  const queries = await generateQueries(query);
  const rankLists = await Promise.all(queries.map(retrieve));
  return reciprocalRankFusion(rankLists).slice(0, topK);
}

Two things to notice. First, the fan-out is parallel: five queries cost roughly the latency of one (assuming your retriever isn’t the bottleneck). Second, the fusion is RRF on rank position — the same primitive we already used to combine BM25 and dense in hybrid search. RRF doesn’t care whether the rank lists came from different retrievers or from the same retriever run on different queries.

Composition with reranking

The dominant production pattern is transform → retrieve → rerank. The transformation widens the candidate pool (better recall at the retrieval step); the cross-encoder reranker reorders the wider pool (better precision at generation). The two stages are complementary because they fail in different directions: transformations risk over-fetching irrelevant neighborhoods, and reranking is exactly the mechanism that filters them back out.

text

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
user query
   │
   ▼  (LLM, small + fast, temperature 0)
[ multi-query / HyDE / step-back ]      ← +1 LLM hop, parallelizable
   │
   ▼  (hybrid: BM25 + dense, fused)
[ retrieval, candidate_k = 200-400 ]    ← wider pool than single-query baseline
   │
   ▼  (cross-encoder)
[ rerank, top_k = 25 ]                  ← precision stage
   │
   ▼
[ generator ]

The candidate_k budget shifts when you add transformations. A baseline pipeline retrieving 200 candidates from one query becomes a pipeline retrieving 100 from each of three queries, deduplicating to ~200 unique. Same downstream cost; meaningfully different recall on the queries that were hurting before.

Trade-offs, failure modes, gotchas

Every transformation is another LLM hop. Latency adds. A 250 ms Haiku call plus 100 ms of retrieval plus 200 ms of rerank plus 1.5 s of generation is a 2 s endpoint. Pick a small fast model (Haiku, GPT-5-Nano, Gemini Flash-Lite — whichever your stack already pays for) and cache the transformation output. At temperature 0 the rewrite is deterministic; a content-addressed cache keyed on hash(query) pays for itself within hours of traffic.

HyDE fails on niche domains. The hallucinated document only helps if it lands in roughly the same embedding region as the real corpus. On internal documentation full of product-specific jargon, on legal or biomedical text the LLM hasn’t memorized, the hallucination drifts and HyDE can actively hurt recall versus the plain query. Always A/B against the baseline on your eval set before turning it on.

Multi-query over-fetches without bound. K=3 is the common default; K=5 is already a lot. Beyond that, the marginal new documents per extra query approach zero while cost stays linear. The right K is the smallest one that closes your recall@candidate_k gap on the eval set, not the largest one your latency budget tolerates.

Decomposition is brittle on under-specified questions. “What’s the average rainfall in Seattle in Q3?” decomposes into useful sub-questions (“rainfall Seattle by month”, “definition of Q3 fiscal quarter”). “Why’s the dashboard broken?” decomposes into garbage (“what is a dashboard”). The decomposer is itself an LLM call that can be wrong, and its errors propagate to every retrieval downstream. Production systems gate decomposition behind a classifier (“is this multi-hop?”) rather than running it on every query.

Step-back over-generalizes. Abstracting “why’s the prod cluster slow?” into “what causes cluster performance degradation?” might pull in 50 textbook posts on general distributed-systems performance, drowning the one runbook entry that actually answers the operator. The fix is to retain the original query in the retrieval set and treat the step-back result as a recall boost, not a replacement.

Caching is non-obvious with stochastic LLMs. At temperature > 0 the rewrite is non-deterministic and the cache key has to include a sample seed if you want hits. Most teams set temperature 0 for the transformation step exactly because it makes caching tractable; reserve temperature > 0 for the generation step where diversity actually helps.

Don’t run transformations and a query router in opposition. A query router (cheap classifier → “send this to the FAQ index, not the docs index”) and a query transformation (LLM → “here are three variants”) have overlapping responsibilities; the LLM can route a query to a useless index and then helpfully generate three variants for that useless index. Compose them deliberately: route first, transform after.

Evaluate against the plain query. Query transformations are easy to add and hard to validate. Recall@k on a labeled eval set is the only honest measure. The published numbers in framework blog posts are typically on MS MARCO or BEIR; your domain may behave differently, and the gain is often smaller than the marketing suggests. The same eval discipline that the reranking article called for applies one stage earlier.