$ cat ai-engineering/rag-evaluation.md

RAG Evaluation: Recall, Faithfulness, and Answer Quality

Retrieval metrics, generation metrics, and the judge problem: how to evaluate a RAG pipeline end-to-end with recall@k, faithfulness, and Ragas.

Jatin Bansal@blog:~/ai-engineering$ open rag-evaluation

A team ships a RAG system. They eyeball a dozen queries, the answers look right, and they cut the release. Three weeks later the head of support pings them: half the answers cite the wrong runbook, and the other half are confidently wrong about which one is current. The team scrambles. Was it the new embedding model? The chunker change? The reranker pool size? They have no idea, because they never measured anything. Every dial they touched between launch and incident is a suspect, and the only way to pin the blame is to build the evaluation harness they should have built before launch. RAG evaluation is the difference between “the demo worked” and “the system works.”

Opening bridge

The query-transformations article ended on a hard line: recall@k on a labeled eval set is the only honest measure. Every retrieval article in this series has gestured at the same point — the reranking piece called out that “a reranker cannot fix recall” and only an eval set tells you which side of the line you’re on; the vector-database piece warned that ANN recall degrades silently when you tune ef_search. Today’s piece is the one that cashes those promises: what the eval set actually contains, what numbers you compute from it, and how to extend the same discipline past retrieval into the generation step.

What “RAG evaluation” actually means

A RAG pipeline has two services in a trench coat: a retriever and a generator. Each fails in different ways, each is measured with different metrics, and the joint behavior is not the product of the two component scores. Evaluation has to mirror that shape — it is layered, not a single number.

At the retrieval layer you ask: of the documents the system was supposed to surface for this query, how many did it actually return, and where in the ranking did they land? These are classical IR questions, with classical IR metrics — recall@k, precision@k, MRR, nDCG. They are deterministic given a labeled set; two runs on the same index and queries produce the same numbers.

At the generation layer you ask: given the documents the system did surface, is the produced answer faithful to them, does it actually answer the question, and does it leave anything important out? These are reference-comparison questions, often delegated to an LLM judge because the answers are free-form text. They are statistical, not deterministic; the same prompt evaluated twice can move a point or two on faithfulness depending on the judge’s sample.

Both layers matter, and the layers can hide each other’s bugs. A pipeline can have great recall but produce hallucinated answers (generator over-extrapolates beyond the retrieved passages). It can have perfect faithfulness scores but score badly on end-user satisfaction (the right passage was never retrieved, so the generator faithfully answers the wrong question with the closest thing it found). End-to-end metrics catch the joint behavior; component metrics localize the blame.

The distributed-systems parallel

This is the test pyramid plus the SLI/SLO layering familiar from production observability. Component-level metrics — recall@k, latency at the retriever — are SLIs: cheap to compute, reproducible, useful for alerting on regressions. End-to-end metrics — answer correctness, user satisfaction — are SLOs and customer-facing KPIs: more expensive, statistical, what the business actually cares about.

The deeper parallel is closer to EXPLAIN ANALYZE vs client-side response time. EXPLAIN tells you the index was used, the plan was good, the buffers were hit. The client still might experience a slow query because the result was serialized poorly, or the application processed it incorrectly. Retrieval metrics are EXPLAIN; generation metrics are client-side. You need both, and the gap between them is where the interesting bugs live.

One last parallel worth keeping in mind: deterministic component tests are like unit tests, while LLM-judged generation evals are closer to chaos engineering experiments. You are not asserting “this output equals that output”; you are making statistical claims about the system’s behavior over a distribution of inputs, with measurement noise from the judge baked into the signal. The right unit of analysis is the trend over the eval set across many runs, not the score on any individual sample.

Retrieval metrics

Assume a labeled eval set: a list of queries, each annotated with the IDs of the documents (or chunks) that should be retrieved to answer the query. For each query, the retriever returns a ranked list of candidate IDs. The classical metrics fall out:

Recall@k: of the gold documents, what fraction appear in the top k returned? recall@k = |relevant ∩ top_k| / |relevant|. The single most important retrieval number for RAG, because the LLM cannot answer from documents you didn’t retrieve.
Precision@k: of the top k, what fraction are gold? precision@k = |relevant ∩ top_k| / k. Matters less than recall in RAG, because the generator and the reranker already filter precision downstream.
MRR (Mean Reciprocal Rank): averaged over the eval set, 1 / rank_of_first_relevant. Rewards getting the first gold passage to rank 1. Useful when there’s typically one “right” passage per query.
nDCG@k (Normalized Discounted Cumulative Gain): weights gains by position with a logarithmic discount; if you have graded relevance (some passages more useful than others), this is the metric that uses the gradations.

Recall is the floor: if your retriever’s recall@retrieve_k is 0.6, generation quality cannot exceed roughly that number on questions where the answer requires the missing 40%. MRR and nDCG matter once recall is acceptable; they tell you whether the right passage made it into the top attention budget, which on a long-context model is empirically smaller than its advertised window thanks to the lost-in-the-middle effect (Liu et al., 2023) — see the context-engineering article for the AOT vs JIT trade-offs that fall out of that attention-budget constraint.

A practical convention: report recall@retrieve_k (the candidate count handed to the reranker, usually 100–400) and recall@k (the count fed to the LLM, usually 5–25) as two separate numbers. The first measures the bi-encoder; the second measures the full retrieval cascade including reranking and any query transformations.

Generation metrics

Once the retriever has surfaced a context set and the generator has produced an answer, four questions matter:

Faithfulness — does every claim in the answer follow from the retrieved context? A 0.6 score means roughly 40% of the answer’s claims are not supported by the passages the model was given. Faithfulness is the operationalization of “no hallucinations against the source” and is the single most important generation-side number for RAG. The reference paper that introduced the Ragas formulation is Es et al., 2023.
Answer relevance — does the answer actually address the question? A model can be perfectly faithful and produce a long, on-topic-adjacent ramble that never directly answers what was asked. Ragas implements this by generating N plausible questions from the answer and computing average cosine similarity to the original question.
Context precision — of the contexts the retriever surfaced, how many were actually useful for answering? Order-sensitive: a useful chunk at rank 1 scores higher than the same chunk at rank 10. Catches “we returned the gold passage but buried it behind irrelevant noise.”
Context recall — does the retrieved context cover everything the reference answer needs? Requires a ground-truth answer to compare against; the judge decomposes the reference into atomic claims and checks each against the retrieved context.

The first two are reference-free (they only need the question, the context, and the model’s answer); the second two come in reference-free and reference-based variants. Reference-free metrics are cheaper to maintain (no ground-truth answers to write) but they bound your eval to what the judge can verify from the retrieved passages — they cannot catch missing information that should have been retrieved. That blind spot is exactly what reference-based context recall is designed to plug.

The judge problem

Generation metrics are almost universally computed by an LLM-as-judge. The metric prompt asks a model to decompose the answer into claims, decide whether each claim is supported, and emit a structured verdict. The numeric score is a function of those verdicts. This works well in aggregate and is wrong in characteristic ways on individual samples — read the reranking article’s discussion of evaluation discipline and apply the same skepticism here.

Practical things to know about judges:

Position bias. Pairwise judges prefer the first answer shown, sometimes by 5+ points. Mitigate by running each pair twice with positions swapped and averaging. Pointwise judges (the Ragas style) avoid this.
Calibration drift across model versions. Faithfulness scored by Claude Sonnet 4.6 is not the same scale as faithfulness scored by GPT-5.5. Pin the judge model in your eval config and treat a judge upgrade as a metric reset, not a comparable run.
NaN scores on invalid JSON. The biggest reported pain point in Ragas is the judge occasionally returning malformed JSON, which silently produces a NaN for that sample. A few NaNs across a 200-sample eval are noise; a systematic NaN pattern correlated with a query type is a bug in your prompt template. Always log the raw judge outputs alongside the numeric scores so you can debug NaNs without re-running the eval. Schema-constrained structured output eliminates this class of failure at the API boundary — strongly preferred for new judge implementations.
Judge cost is real. A 200-sample eval with four metrics is 800+ judge calls per run, plus the underlying generator calls. Run the cheap deterministic retrieval metrics on every commit; gate the LLM-judged metrics behind a slower CI job or a manual trigger.

Building the golden set

The hardest part of RAG evaluation is not computing metrics — it is producing the labeled queries to compute them against. Three approaches, in increasing order of effort and quality:

Synthetic. Generate questions from your corpus with an LLM, label the source chunk as the relevant document. Fast to bootstrap, useful as a regression smoke test, systematically biased — the LLM generates the kinds of questions it knows how to answer well, which is not the distribution real users send. Ragas ships a test-set generator that produces this in a few lines; use it as starter fuel, not as your release gate.
Replayed user traffic. Sample real queries from logs, have a human (or a stronger model) label which retrieved documents were actually relevant. Captures the real distribution including the weird tail. Slow to grow. The right backbone for any system that has been in production for more than a few weeks.
Domain-expert authored. A subject expert writes 50–200 questions covering the surface area you care about, with reference answers. Highest quality, lowest variance, most expensive. Worth it for high-stakes verticals (legal, medical, finance) where a query family that gets answered wrong has business consequences.

Production teams run all three. Synthetic for fast feedback during development, replayed traffic as the baseline regression set, expert-authored as the release gate. The point of the eval set is to be stable — freeze a snapshot, run two cycles, see whether your changes moved the metric or you’re just looking at LLM-judge noise.

Code: a hand-rolled retrieval-metrics harness in Python

Retrieval metrics need no model calls, just set arithmetic on IDs. Install nothing beyond the standard library; you already have the index. The function below computes recall@k, MRR, and nDCG@k for a single query, then aggregates across the eval set.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import math
from dataclasses import dataclass

@dataclass
class EvalQuery:
    query_id: str
    query: str
    relevant_ids: set[str]              # gold-labeled documents
    graded: dict[str, int] | None = None  # optional: id -> 0/1/2/3 relevance

def recall_at_k(returned: list[str], relevant: set[str], k: int) -> float:
    if not relevant:
        return 0.0
    return len(set(returned[:k]) & relevant) / len(relevant)

def reciprocal_rank(returned: list[str], relevant: set[str]) -> float:
    for i, doc_id in enumerate(returned):
        if doc_id in relevant:
            return 1.0 / (i + 1)
    return 0.0

def ndcg_at_k(returned: list[str], graded: dict[str, int], k: int) -> float:
    dcg = sum(
        (2 ** graded.get(doc_id, 0) - 1) / math.log2(i + 2)
        for i, doc_id in enumerate(returned[:k])
    )
    ideal = sorted(graded.values(), reverse=True)[:k]
    idcg = sum((2 ** g - 1) / math.log2(i + 2) for i, g in enumerate(ideal))
    return dcg / idcg if idcg > 0 else 0.0

def evaluate_retriever(eval_set: list[EvalQuery], retrieve, k: int = 10) -> dict:
    recalls, rrs, ndcgs = [], [], []
    for q in eval_set:
        returned = [hit.id for hit in retrieve(q.query, top_k=max(k, 100))]
        recalls.append(recall_at_k(returned, q.relevant_ids, k))
        rrs.append(reciprocal_rank(returned, q.relevant_ids))
        if q.graded:
            ndcgs.append(ndcg_at_k(returned, q.graded, k))
    return {
        f"recall@{k}": sum(recalls) / len(recalls),
        "mrr": sum(rrs) / len(rrs),
        f"ndcg@{k}": sum(ndcgs) / len(ndcgs) if ndcgs else None,
    }

This is what your CI runs on every retriever change. It costs nothing beyond the ANN lookups, and it catches the obvious regressions — recall dropping by 5 points after a chunker change, MRR collapsing after an embedding-model swap. Pair it with a fixed eval set checked into version control so the numbers are comparable across commits.

Code: generation metrics with Ragas in Python

For the generation side, hand-rolling LLM-as-judge prompts is doable but reinventing wheels. Ragas is the de-facto standard library — install with pip install ragas. The current API uses SingleTurnSample + EvaluationDataset + evaluate():

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from ragas import EvaluationDataset, evaluate
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import (
    Faithfulness,
    ResponseRelevancy,
    LLMContextPrecisionWithoutReference,
    LLMContextRecall,
)
from ragas.llms import LangchainLLMWrapper
from langchain_anthropic import ChatAnthropic

judge = LangchainLLMWrapper(ChatAnthropic(model="claude-sonnet-4-6"))

samples = [
    SingleTurnSample(
        user_input="When does the dashboard auto-refresh?",
        retrieved_contexts=[
            "Grafana panels refresh on the interval set in the dashboard "
            "settings, default 30 seconds, configurable per panel.",
            "Prometheus scrape interval defaults to 15 seconds.",
        ],
        response="By default the dashboard auto-refreshes every 30 seconds.",
        reference="The dashboard refreshes every 30 seconds by default; "
                  "the interval is configurable per panel.",
    ),
    # ...more samples
]

dataset = EvaluationDataset(samples=samples)

result = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(llm=judge),
        ResponseRelevancy(llm=judge),
        LLMContextPrecisionWithoutReference(llm=judge),
        LLMContextRecall(llm=judge),
    ],
)

print(result.to_pandas())

A few things to flag. The reference field is optional and only needed for the reference-based context recall metric — if you don’t have ground-truth answers, drop it and use LLMContextPrecisionWithoutReference plus Faithfulness plus ResponseRelevancy, all of which are reference-free. The judge is wrapped via LangchainLLMWrapper; you can swap in any LangChain chat model (OpenAI, Vertex, Bedrock) the same way. The result is a pandas DataFrame, one row per sample, columns per metric — easy to write into a CSV or SQLite eval store keyed on commit SHA so you can diff metric movements across changes.

Code: a TypeScript faithfulness harness

There is no Ragas-equivalent that is widely adopted on the TypeScript side; teams typically write the judge prompt themselves with the Vercel AI SDK or the Anthropic TypeScript SDK. Below is a minimal faithfulness implementation using the Vercel AI SDK’s generateObject for structured output. Install: npm install ai @ai-sdk/anthropic zod.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import { anthropic } from "@ai-sdk/anthropic";
import { generateObject } from "ai";
import { z } from "zod";

const ClaimVerdicts = z.object({
  claims: z.array(
    z.object({
      claim: z.string(),
      supported: z.boolean(),
      reasoning: z.string(),
    }),
  ),
});

const FAITHFULNESS_PROMPT = `Decompose the ANSWER into atomic factual
claims. For each claim, decide whether it is directly supported by the
CONTEXT. Return one entry per claim, with the verbatim claim text, a
boolean "supported" verdict, and a one-sentence reasoning.`;

async function faithfulness(
  question: string,
  context: string[],
  answer: string,
): Promise<number> {
  const { object } = await generateObject({
    model: anthropic("claude-sonnet-4-6"),
    schema: ClaimVerdicts,
    prompt: `${FAITHFULNESS_PROMPT}\n\nQUESTION: ${question}\n\nCONTEXT:\n${context.join(
      "\n---\n",
    )}\n\nANSWER: ${answer}`,
  });
  const total = object.claims.length;
  if (total === 0) return Number.NaN;
  const supported = object.claims.filter((c) => c.supported).length;
  return supported / total;
}

async function evaluateBatch(samples: Array<{
  question: string; context: string[]; answer: string;
}>) {
  const scores = await Promise.all(
    samples.map((s) => faithfulness(s.question, s.context, s.answer)),
  );
  const valid = scores.filter((s) => !Number.isNaN(s));
  return {
    faithfulness_mean: valid.reduce((a, b) => a + b, 0) / valid.length,
    nan_count: scores.length - valid.length,
  };
}

The shape is the same as Ragas’s: decompose into claims, judge each, average. Reporting the NaN count alongside the mean is non-optional — it surfaces the structured-output-failure mode that Ragas users complain about, and without it a degrading judge looks like a degrading system. The same pattern extends to answer relevance (generate N questions from the answer, compute cosine similarity to the original) and context precision (judge each retrieved chunk for usefulness, compute average precision over the verdicts).

Trade-offs, failure modes, gotchas

Component metrics are necessary but not sufficient. A pipeline can hit recall@10 = 0.95 and still produce bad answers if the generator hallucinates over the retrieved context or if the right passage is at rank 9 and gets lost-in-the-middle. End-to-end metrics catch what component metrics miss. The discipline is to ladder up: deterministic retrieval metrics on every commit, LLM-judged metrics nightly or pre-release, human review on a sampled subset weekly.

Don’t compare scores across judge models. A faithfulness number from Sonnet 4.6 is not the same number from Haiku 4.5 or from GPT-5.5. Treat the judge as a fixed instrument. If you must upgrade, dual-run for a window and recompute your baselines, the same way you’d handle a sensor recalibration in a monitoring stack.

Synthetic eval sets are systematically wrong in the same direction. They overweight the kinds of questions the generator finds easy to write, which are exactly the questions the generator finds easy to answer. The first time you compare synthetic-eval scores against real-traffic-eval scores you will see a 15-point gap; that gap is the distribution mismatch, not progress.

Eval-set rot is real. The corpus changes, the queries users ask change, and the questions that mattered six months ago are not the questions that matter today. A frozen eval set is excellent for regression testing and useless for telling you whether the product still solves the user’s problem. Refresh the eval set on a schedule (monthly is reasonable for a fast-moving product) and keep the old snapshots around for trend analysis.

Multi-hop and aggregation questions break component metrics. Recall@k is undefined for a question whose answer requires synthesizing across 5 documents in a way that no single document is “the” gold passage. For multi-hop, recall@k is a useful lower bound (each hop’s gold doc had better be in there), but the only honest end-to-end measure is the answer-correctness eval, ideally with reference answers.

Statistical significance over single-sample scrutiny. A 2-point faithfulness movement on a 50-sample eval is well inside the judge’s noise floor. Either grow the eval set to 200+ samples, or report bootstrapped confidence intervals and only act on changes that clear the interval. Otherwise you’ll be chasing ghosts and rolling back perfectly fine changes.

Track cost and latency as first-class metrics. A pipeline that scores 0.05 higher on faithfulness but doubles per-query cost is a regression in production even though it’s an improvement on the dashboard. Plot cost-per-query and p95 latency on the same dashboard as quality metrics; the right comparison is always the Pareto frontier, not a single axis.