Chunking Strategies for Retrieval
Why chunk size is the most undertuned variable in RAG, how recursive, semantic, and structural chunking differ, and when parent-document retrieval beats them all.
Two teams build a RAG pipeline over the same corpus, with the same embedding model, the same vector database, the same reranker, and the same generator. One ships a system that consistently surfaces the right passage in the top three. The other ships one that hallucinates. The thing that changed between them, more often than anything else, is the chunking policy. Chunking is the largest undertuned variable in retrieval.
Opening bridge
In the previous article we covered the index structures — HNSW, IVF, ScaNN — that make nearest-neighbor search practical, and treated each indexed row as a black box. Today we pry that box open. The row stores a chunk, and the policy that produces chunks decides what your ANN index can possibly retrieve. A well-tuned HNSW over badly chosen chunks underperforms a flat scan over good ones.
What chunking actually is
A chunk is the smallest unit you commit to a vector index. It does double duty: it is the retrieval atom (what gets returned to a query) and the embedding payload (what the model sees when producing the vector). The two roles pull in opposite directions, and that tension is the whole problem.
- The retrieval atom wants to be small. Small chunks are precise — the returned text contains the answer with little surrounding noise, which keeps the generator’s context window lean.
- The embedding payload wants to be coherent. An embedding is a single point summarizing the whole chunk; if the chunk spans five unrelated topics, the average lands nowhere in particular, and the chunk fails to be retrieved for any of those topics.
Chunking is the policy that produces chunks from source documents. Get it right and the rest of the pipeline forgives a lot of sins. Get it wrong and no amount of reranking, query rewriting, or fine-tuning will recover.
The distributed systems parallel
Chunk size is the page size question from storage engines. Pick a Postgres BLCKSZ or InnoDB page size and you trade the same way: small pages give fine-grained random access and tight cache utilization at the cost of more index overhead per byte stored; large pages amortize index overhead but incur read amplification on point queries. The numbers are different, the shape is identical.
Three other parallels are worth naming explicitly, because they predict where common chunking tactics break:
| Chunking concept | Systems analogue |
|---|---|
| Fixed-size chunking | Block-aligned writes — fast, oblivious to record boundaries |
| Recursive splitting | Variable-length records with delimiter hierarchy |
| Semantic chunking | Content-defined chunking (rsync, restic rolling-hash) |
| Parent-document retrieval | Covering index — small key for lookup, large payload returned |
| Structural chunking (markdown/AST) | Schema-aware partitioning |
| Overlap | Sliding-window reads to preserve cross-boundary context |
The parent-document parallel is the most useful one to internalize. A covering index in a relational database is small enough to fit in memory and points at a larger row payload that’s expensive to scan. Parent-document retrieval applies the same idea to RAG: embed a small, dense child chunk for the lookup, but return the larger parent passage to the generator. The retrieval atom and the embedding payload get to be different sizes — the tension above dissolves.
Mechanics: the chunker zoo
There are five chunking families worth knowing, in roughly increasing sophistication.
1. Fixed-size, character or token
Slice the document every N characters or every N tokens. Optionally overlap each chunk with the previous by some fraction (10–20% is conventional). This is the baseline — fast to compute, deterministic, oblivious to structure. It will cut sentences in half, split a code function across two chunks, and orphan headers from their bodies. Use it only when you genuinely don’t know anything about the input format.
2. Recursive character splitting
The default workhorse of production RAG. Given an ordered list of separators (["\n\n", "\n", ". ", " ", ""]), the splitter tries to keep chunks under a target size by splitting on the strongest separator first, then falling back to weaker ones only when a section is still too large. The effect is that paragraph boundaries are preserved when possible, sentence boundaries when not, and word boundaries as a last resort. LangChain’s RecursiveCharacterTextSplitter is the canonical implementation; LlamaIndex’s SentenceSplitter does something similar with a sentence-tokenizer twist.
3. Structural chunking
Use the document’s own structure. For markdown: split on headers and treat each section as a chunk, carrying the header path forward in the metadata so retrieval can return “Section 4.2.1 > Tuning ef_search.” For HTML: split on block-level elements. For source code: use an AST parser (Python’s ast, tree-sitter for any language) to split by function or class. For JSON/XML: split by object boundary. This is the highest-fidelity option when the input has structure that maps to retrieval intent.
4. Semantic chunking
Embed every sentence, walk through the document, and start a new chunk whenever cosine similarity between consecutive sentences drops below a threshold. The intuition is that semantic boundaries are real, even when no markup signals them. The cost is that you pay one embedding call per sentence just to chunk the corpus, before you embed anything for retrieval. Worth it for unstructured prose where structural signals are absent; usually overkill when good markup exists. LlamaIndex’s SemanticSplitterNodeParser is the reference implementation.
5. Parent-document and sentence-window retrieval
Decouple the embedding payload from the returned payload. Parent-document retrieval embeds small child chunks (say, 256 tokens each), indexes them, but stores a parent_id pointing back to a larger parent chunk (1–2k tokens). At query time the top-k child chunks are retrieved, deduplicated to their parents, and the parents are what get injected into the prompt. Sentence-window retrieval is a variant: embed each sentence, and at retrieval time return a window of ±N sentences around the match.
Both patterns are the right answer more often than people realize, because they collapse the retrieval-atom-versus-embedding-payload tension into something the user explicitly controls.
Bonus: contextual retrieval
A 2024 Anthropic post on contextual retrieval demonstrated that prefixing each chunk with a short LLM-generated description of where the chunk sits in the source document (50–100 tokens) reduces retrieval failures by 35–49% before any reranking. It’s expensive at index time — one LLM call per chunk — and cheap at query time, and it composes with any of the chunking strategies above. Worth knowing when the corpus is small, static, and high-value enough to justify the build cost.
Code: from raw to recursive to semantic
Start with a hand-rolled fixed-size chunker as a sanity check. It is rarely what you ship, but the contract is exactly visible:
| |
| |
Now the recursive splitter, which is what 80% of production systems should actually use. The LangChain Python package and the LangChain JS package keep the API in step.
| |
| |
A few things worth noticing. The Python version measures chunk size in tokens (because the underlying embedding model and context budget are token-counted), via the from_tiktoken_encoder constructor; the JS version defaults to characters and you should wrap it with a token counter if precise token budgets matter. The separators list defines the cut hierarchy — for markdown corpora, prepend ["\n# ", "\n## ", "\n### "] to keep section headers intact.
For semantic chunking, the basic algorithm is small enough to write directly. Embed each sentence, take a rolling similarity, cut when the similarity to the previous sentence drops below a percentile threshold.
| |
The TypeScript shape is identical with the OpenAI Node SDK; the only meaningful difference is that you’ll want a real sentence tokenizer (compromise or wink-nlp) instead of a regex. For production semantic chunking on real corpora, lean on LlamaIndex’s SemanticSplitterNodeParser rather than rolling your own — there are edge cases (heading-only sentences, very short paragraphs, language detection) that the libraries already handle.
Trade-offs, failure modes, gotchas
Chunk size and the generator are coupled. A 2k-token chunk feels generous until you remember that top-k retrieval means k of them go into the prompt. With k=5 and a 2k chunk size you’ve spent 10k tokens before the system prompt or query. Pick chunk size knowing how many chunks the generator will see and how much headroom you need for instructions and the answer.
Silent embedding truncation. Every embedding model has a context limit (text-embedding-3-small is 8191 tokens). Chunks that exceed it are silently truncated by the provider — you get an embedding for a different document than the one you stored. Always count tokens before sending, and assert the count against the model’s limit.
Boundary-straddling facts. “The deployment failed at 03:14 UTC because the readiness probe returned 503.” If the timestamp and the cause end up in different chunks, both chunks individually answer “what happened” badly. Overlap reduces this, but only papers over the symptom. The honest fixes are recursive or structural splitting (keep semantic units intact) and parent-document retrieval (embed small, return the whole surrounding section).
Overlap inflates everything. 20% overlap means 20% more chunks, 20% more embedding cost, 20% more index storage, and a near-guarantee of duplicate hits in top-k results that you’ll need to dedupe in the generator step. It’s a real trade, not a free win.
Small chunks make multi-hop reasoning worse, not better. Multi-hop questions require the model to combine information that lives in two or more places. Smaller chunks scatter that information across more retrieval candidates, increasing the chance that at least one needed hop is missed in top-k. Larger chunks (or parent-document retrieval) keep co-occurring facts co-located.
Structural chunking is fragile to upstream changes. A markdown splitter that relies on ## headers breaks the day someone writes a doc with **Bold Title** instead. AST-based code splitters break on parse errors. Both modes degrade gracefully if you chain back to a recursive splitter as the fallback path; both fail silently if you don’t.
Semantic chunking is O(N) embeddings before you’ve embedded anything for retrieval. On a corpus of millions of sentences this is the dominant cost. Reserve semantic chunking for high-value, low-volume corpora, or use it only on the long-tail documents that recursive splitting handles badly.
Chunking decisions are not portable across embedding models. A chunk size tuned for an 8191-token-limit asymmetric retrieval model is not optimal for a 512-token symmetric sentence model. Re-tune when you change models — and you will change models — using a labeled retrieval eval set, not vibes.
Further reading
- Anthropic — Introducing Contextual Retrieval — the 2024 post showing that LLM-generated chunk-level context prefixes cut retrieval failures by 35–49%. The methodology and the eval setup are both worth studying; the technique composes cleanly with any chunker.
- Pinecone — Chunking Strategies for LLM Applications — a thorough walk-through of fixed-size, recursive, structural, and document-specific chunking with practical heuristics and code.
- Greg Kamradt — 5 Levels of Text Splitting — the most-cited practitioner notebook on the subject, from fixed-size up to agentic chunking, with live LangChain and LlamaIndex demos.
- LangChain — Text Splitters Conceptual Guide — the framework-side reference for choosing among character, token, semantic, structural, and code-aware splitters.
What to read next
- Vector Databases & ANN Indexes — where the chunks you produce here get stored: HNSW, IVF, and the operational trade-offs of pgvector, Qdrant, and Pinecone.
- Text Embeddings: Turning Meaning into Geometry — the geometry the chunker is feeding into: how embeddings are produced, why cosine similarity measures semantic distance, and why batched embedding matters at corpus scale.
- LLM Inference: Tokens, Context, and Sampling — the working-set model that explains why chunk size and top-k jointly determine whether retrieved context fits the generator’s prompt.