jatin.blog ~ $
$ cat ai-engineering/chunking-strategies.md

Chunking Strategies for Retrieval

Why chunk size is the most undertuned variable in RAG, how recursive, semantic, and structural chunking differ, and when parent-document retrieval beats them all.

Jatin Bansal@blog:~/ai-engineering$ open chunking-strategies

Two teams build a RAG pipeline over the same corpus, with the same embedding model, the same vector database, the same reranker, and the same generator. One ships a system that consistently surfaces the right passage in the top three. The other ships one that hallucinates. The thing that changed between them, more often than anything else, is the chunking policy. Chunking is the largest undertuned variable in retrieval.

Opening bridge

In the previous article we covered the index structures — HNSW, IVF, ScaNN — that make nearest-neighbor search practical, and treated each indexed row as a black box. Today we pry that box open. The row stores a chunk, and the policy that produces chunks decides what your ANN index can possibly retrieve. A well-tuned HNSW over badly chosen chunks underperforms a flat scan over good ones.

What chunking actually is

A chunk is the smallest unit you commit to a vector index. It does double duty: it is the retrieval atom (what gets returned to a query) and the embedding payload (what the model sees when producing the vector). The two roles pull in opposite directions, and that tension is the whole problem.

  • The retrieval atom wants to be small. Small chunks are precise — the returned text contains the answer with little surrounding noise, which keeps the generator’s context window lean.
  • The embedding payload wants to be coherent. An embedding is a single point summarizing the whole chunk; if the chunk spans five unrelated topics, the average lands nowhere in particular, and the chunk fails to be retrieved for any of those topics.

Chunking is the policy that produces chunks from source documents. Get it right and the rest of the pipeline forgives a lot of sins. Get it wrong and no amount of reranking, query rewriting, or fine-tuning will recover.

The distributed systems parallel

Chunk size is the page size question from storage engines. Pick a Postgres BLCKSZ or InnoDB page size and you trade the same way: small pages give fine-grained random access and tight cache utilization at the cost of more index overhead per byte stored; large pages amortize index overhead but incur read amplification on point queries. The numbers are different, the shape is identical.

Three other parallels are worth naming explicitly, because they predict where common chunking tactics break:

Chunking conceptSystems analogue
Fixed-size chunkingBlock-aligned writes — fast, oblivious to record boundaries
Recursive splittingVariable-length records with delimiter hierarchy
Semantic chunkingContent-defined chunking (rsync, restic rolling-hash)
Parent-document retrievalCovering index — small key for lookup, large payload returned
Structural chunking (markdown/AST)Schema-aware partitioning
OverlapSliding-window reads to preserve cross-boundary context

The parent-document parallel is the most useful one to internalize. A covering index in a relational database is small enough to fit in memory and points at a larger row payload that’s expensive to scan. Parent-document retrieval applies the same idea to RAG: embed a small, dense child chunk for the lookup, but return the larger parent passage to the generator. The retrieval atom and the embedding payload get to be different sizes — the tension above dissolves.

Mechanics: the chunker zoo

There are five chunking families worth knowing, in roughly increasing sophistication.

1. Fixed-size, character or token

Slice the document every N characters or every N tokens. Optionally overlap each chunk with the previous by some fraction (10–20% is conventional). This is the baseline — fast to compute, deterministic, oblivious to structure. It will cut sentences in half, split a code function across two chunks, and orphan headers from their bodies. Use it only when you genuinely don’t know anything about the input format.

2. Recursive character splitting

The default workhorse of production RAG. Given an ordered list of separators (["\n\n", "\n", ". ", " ", ""]), the splitter tries to keep chunks under a target size by splitting on the strongest separator first, then falling back to weaker ones only when a section is still too large. The effect is that paragraph boundaries are preserved when possible, sentence boundaries when not, and word boundaries as a last resort. LangChain’s RecursiveCharacterTextSplitter is the canonical implementation; LlamaIndex’s SentenceSplitter does something similar with a sentence-tokenizer twist.

3. Structural chunking

Use the document’s own structure. For markdown: split on headers and treat each section as a chunk, carrying the header path forward in the metadata so retrieval can return “Section 4.2.1 > Tuning ef_search.” For HTML: split on block-level elements. For source code: use an AST parser (Python’s ast, tree-sitter for any language) to split by function or class. For JSON/XML: split by object boundary. This is the highest-fidelity option when the input has structure that maps to retrieval intent.

4. Semantic chunking

Embed every sentence, walk through the document, and start a new chunk whenever cosine similarity between consecutive sentences drops below a threshold. The intuition is that semantic boundaries are real, even when no markup signals them. The cost is that you pay one embedding call per sentence just to chunk the corpus, before you embed anything for retrieval. Worth it for unstructured prose where structural signals are absent; usually overkill when good markup exists. LlamaIndex’s SemanticSplitterNodeParser is the reference implementation.

5. Parent-document and sentence-window retrieval

Decouple the embedding payload from the returned payload. Parent-document retrieval embeds small child chunks (say, 256 tokens each), indexes them, but stores a parent_id pointing back to a larger parent chunk (1–2k tokens). At query time the top-k child chunks are retrieved, deduplicated to their parents, and the parents are what get injected into the prompt. Sentence-window retrieval is a variant: embed each sentence, and at retrieval time return a window of ±N sentences around the match.

Both patterns are the right answer more often than people realize, because they collapse the retrieval-atom-versus-embedding-payload tension into something the user explicitly controls.

Bonus: contextual retrieval

A 2024 Anthropic post on contextual retrieval demonstrated that prefixing each chunk with a short LLM-generated description of where the chunk sits in the source document (50–100 tokens) reduces retrieval failures by 35–49% before any reranking. It’s expensive at index time — one LLM call per chunk — and cheap at query time, and it composes with any of the chunking strategies above. Worth knowing when the corpus is small, static, and high-value enough to justify the build cost.

Code: from raw to recursive to semantic

Start with a hand-rolled fixed-size chunker as a sanity check. It is rarely what you ship, but the contract is exactly visible:

python
1
2
3
4
5
6
7
8
9
# stdlib only — no install required
def fixed_size_chunks(
    text: str,
    chunk_size: int = 1000,
    overlap: int = 150,
) -> list[str]:
    assert 0 <= overlap < chunk_size
    step = chunk_size - overlap
    return [text[i : i + chunk_size] for i in range(0, len(text), step)]
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
function fixedSizeChunks(
  text: string,
  chunkSize = 1000,
  overlap = 150
): string[] {
  if (overlap < 0 || overlap >= chunkSize) throw new Error("invalid overlap");
  const step = chunkSize - overlap;
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += step) {
    chunks.push(text.slice(i, i + chunkSize));
  }
  return chunks;
}

Now the recursive splitter, which is what 80% of production systems should actually use. The LangChain Python package and the LangChain JS package keep the API in step.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# pip install langchain-text-splitters tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",
    chunk_size=512,          # target tokens per chunk
    chunk_overlap=64,        # ~12% overlap
    separators=["\n\n", "\n", ". ", " ", ""],
)

with open("article.md") as f:
    docs = splitter.create_documents([f.read()])

for d in docs[:3]:
    print(len(d.page_content), repr(d.page_content[:80]))
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// npm install @langchain/textsplitters
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { readFileSync } from "node:fs";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 512,       // characters here, not tokens — pick consciously
  chunkOverlap: 64,
  separators: ["\n\n", "\n", ". ", " ", ""],
});

const text = readFileSync("article.md", "utf-8");
const docs = await splitter.createDocuments([text]);

for (const d of docs.slice(0, 3)) {
  console.log(d.pageContent.length, JSON.stringify(d.pageContent.slice(0, 80)));
}

A few things worth noticing. The Python version measures chunk size in tokens (because the underlying embedding model and context budget are token-counted), via the from_tiktoken_encoder constructor; the JS version defaults to characters and you should wrap it with a token counter if precise token budgets matter. The separators list defines the cut hierarchy — for markdown corpora, prepend ["\n# ", "\n## ", "\n### "] to keep section headers intact.

For semantic chunking, the basic algorithm is small enough to write directly. Embed each sentence, take a rolling similarity, cut when the similarity to the previous sentence drops below a percentile threshold.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# pip install openai numpy
from openai import OpenAI
import numpy as np
import re

client = OpenAI()

def split_sentences(text: str) -> list[str]:
    # crude — swap for spaCy or pysbd in production
    return [s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]

def embed_batch(sentences: list[str]) -> np.ndarray:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=sentences,
    )
    return np.array([d.embedding for d in resp.data])

def semantic_chunks(text: str, breakpoint_percentile: int = 90) -> list[str]:
    sents = split_sentences(text)
    if len(sents) < 2:
        return sents
    vecs = embed_batch(sents)
    # similarity between consecutive sentence pairs
    sims = [
        float(np.dot(vecs[i], vecs[i + 1]) /
              (np.linalg.norm(vecs[i]) * np.linalg.norm(vecs[i + 1])))
        for i in range(len(sents) - 1)
    ]
    # cut wherever similarity dips below the chosen percentile (i.e. a big drop)
    threshold = np.percentile(sims, 100 - breakpoint_percentile)
    chunks, current = [], [sents[0]]
    for i, sim in enumerate(sims):
        if sim < threshold:
            chunks.append(" ".join(current))
            current = [sents[i + 1]]
        else:
            current.append(sents[i + 1])
    chunks.append(" ".join(current))
    return chunks

The TypeScript shape is identical with the OpenAI Node SDK; the only meaningful difference is that you’ll want a real sentence tokenizer (compromise or wink-nlp) instead of a regex. For production semantic chunking on real corpora, lean on LlamaIndex’s SemanticSplitterNodeParser rather than rolling your own — there are edge cases (heading-only sentences, very short paragraphs, language detection) that the libraries already handle.

Trade-offs, failure modes, gotchas

Chunk size and the generator are coupled. A 2k-token chunk feels generous until you remember that top-k retrieval means k of them go into the prompt. With k=5 and a 2k chunk size you’ve spent 10k tokens before the system prompt or query. Pick chunk size knowing how many chunks the generator will see and how much headroom you need for instructions and the answer.

Silent embedding truncation. Every embedding model has a context limit (text-embedding-3-small is 8191 tokens). Chunks that exceed it are silently truncated by the provider — you get an embedding for a different document than the one you stored. Always count tokens before sending, and assert the count against the model’s limit.

Boundary-straddling facts. “The deployment failed at 03:14 UTC because the readiness probe returned 503.” If the timestamp and the cause end up in different chunks, both chunks individually answer “what happened” badly. Overlap reduces this, but only papers over the symptom. The honest fixes are recursive or structural splitting (keep semantic units intact) and parent-document retrieval (embed small, return the whole surrounding section).

Overlap inflates everything. 20% overlap means 20% more chunks, 20% more embedding cost, 20% more index storage, and a near-guarantee of duplicate hits in top-k results that you’ll need to dedupe in the generator step. It’s a real trade, not a free win.

Small chunks make multi-hop reasoning worse, not better. Multi-hop questions require the model to combine information that lives in two or more places. Smaller chunks scatter that information across more retrieval candidates, increasing the chance that at least one needed hop is missed in top-k. Larger chunks (or parent-document retrieval) keep co-occurring facts co-located.

Structural chunking is fragile to upstream changes. A markdown splitter that relies on ## headers breaks the day someone writes a doc with **Bold Title** instead. AST-based code splitters break on parse errors. Both modes degrade gracefully if you chain back to a recursive splitter as the fallback path; both fail silently if you don’t.

Semantic chunking is O(N) embeddings before you’ve embedded anything for retrieval. On a corpus of millions of sentences this is the dominant cost. Reserve semantic chunking for high-value, low-volume corpora, or use it only on the long-tail documents that recursive splitting handles badly.

Chunking decisions are not portable across embedding models. A chunk size tuned for an 8191-token-limit asymmetric retrieval model is not optimal for a 512-token symmetric sentence model. Re-tune when you change models — and you will change models — using a labeled retrieval eval set, not vibes.

Further reading