LLM Inference: Tokens, Context, and Sampling
How LLMs actually process text: tokenization with BPE, the context window as working memory, KV caching, and sampling parameters that control output variance.
Every LLM API call is a black box until it isn’t. Understanding what actually happens inside — tokenization, autoregressive decoding, the KV cache, sampling — changes how you architect AI systems. It tells you why context-heavy applications cost more, why determinism requires explicit effort, and why filling a context window is never free.
Opening bridge
This is the first article in the AI Engineering curriculum. It lays the mechanical foundation for everything that follows: RAG, memory systems, agents, evals. Before you can reason about managing context or retrieving relevant documents, you need to know what the context window is, what it costs, and how text becomes tokens becomes probabilities becomes output.
What an LLM actually is
An LLM is a function that maps a sequence of tokens to a probability distribution over the next token:
| |
That’s the whole thing. You call it once, sample a token from the distribution, append it to the sequence, and call it again. Repeat until you hit a stop condition or max_tokens. This is autoregressive generation.
The model has no memory between calls. No state, no hidden accumulator. Everything it “knows” during a given inference call is in the input sequence — the context window.
Tokenization: text → integers
Before text touches the model, a tokenizer converts it to a sequence of integer IDs from a fixed vocabulary (typically 50k–200k tokens).
Modern LLMs use Byte Pair Encoding (BPE) — a compression algorithm applied to text. BPE begins with individual bytes and iteratively merges the most frequently adjacent pair into a new token, repeating until the target vocabulary size is reached. The resulting vocabulary assigns single tokens to common substrings ( the, ing, function) while rare sequences fall back to byte-level representations.
Practical consequences that bite engineers:
- Tokens ≠ words.
"unfortunately"is one token;"ундовищен"might be six. - Whitespace is part of the token.
cat(leading space) andcatare different token IDs. - Numbers tokenize poorly.
"12345"might split as["12", "345"]— this is why base LLMs are bad at arithmetic. - Code is efficient. Language keywords and common identifiers get single tokens; your prompts about code are cheaper than you think.
| |
| |
tiktoken is OpenAI’s tokenizer library, widely used as a token-counting standard across providers.
The context window as working memory
The context window is the model’s total working memory for a single inference call. Every token — system prompt, conversation history, retrieved documents, tool outputs, user message — competes for this finite space.
Current limits (mid-2026): Claude 3.5 Sonnet at 200k tokens, GPT-4o at 128k, Gemini 1.5 Pro at 1M+. Capacity is not performance. Long-context attention degrades in practice: models “lose” information in the middle of very long contexts, and precision on complex reasoning tasks drops significantly past ~30k tokens even on models that technically support 200k.
Cost model: Transformer attention is O(n²) in sequence length during the prefill phase — every input token attends to every prior input token. A 100k-token prompt is ~10,000× more compute-intensive in the attention layers than a 1k-token prompt.
| |
| |
The distributed systems parallel
The context window maps cleanly onto the working set model from virtual memory:
| LLM concept | Systems analogue |
|---|---|
| Token | Cache line — smallest unit of information |
| Context window | Working set — everything currently resident in L1/L2 |
| KV cache | TLB — memoizes expensive computations for recently seen pages |
| RAG retrieval | Page fault handler — fetches from slower storage on demand |
| Temperature | Network jitter — controlled randomness to escape degenerate fixed points |
The working set insight is the most operationally important one: if the information required for a task is not in the context, the model cannot access it. It will confabulate. RAG, tool calls, and extended context models are three different strategies for expanding the effective working set, each with different latency and cost profiles.
KV cache: amortizing attention cost
Without optimization, generating N output tokens requires N forward passes, each attending to all prior tokens. The KV cache stores the intermediate attention keys and values for every processed token. On each new decode step, only the single new token’s K/V vectors are computed; the rest are read from cache.
This produces two distinct operational phases with very different characteristics:
- Prefill — processing the full input prompt. Compute-bound, expensive, and unavoidable on a cold cache. This is where O(n²) attention cost is paid.
- Decode — generating each output token. Memory-bandwidth-bound, fast, and cheap relative to prefill. You’re reading a large cache on each step.
Prompt caching (offered by Anthropic, OpenAI, and others) persists the KV cache for a fixed prompt prefix across API calls. If your system prompt is 10k tokens, prompt caching turns 10k tokens of prefill cost into a cache hit on every subsequent call. This is one of the highest-ROI optimizations for production AI systems.
The KV cache also explains why long-context inference is expensive even if you only generate a short response: serving a 200k-token context requires tens of gigabytes of KV cache per concurrent request, consuming GPU memory that could otherwise serve additional users.
Sampling: from probabilities to text
After the model computes logits (raw scores over the full vocabulary), the decoding pipeline applies:
Temperature scaling — divide all logits by T before softmax:
- T → 0: greedy decoding, always picks the highest-probability token
- T = 1: sample directly from the model’s output distribution
- T > 1: flatter distribution — more variance, more likely to be incoherent at extremes
Top-k — zero out all tokens except the k highest-probability ones before sampling. Hard cutoff, fixed k.
Top-p (nucleus sampling) — zero out all tokens outside the smallest set whose cumulative probability ≥ p. Adaptive cutoff — the effective k varies with how peaked or flat the distribution is.
In practice: use temperature=0 for factual extraction, classification, and structured output. Use temperature=0.7, top_p=0.9 for open-ended generation. Above temperature=1.2 you’re usually just generating noise.
| |
| |
The Anthropic Messages API covers the full parameter reference including top_k (available on Claude but not exposed by all clients). The Vercel AI SDK provides a unified interface if you need to switch providers.
Trade-offs, failure modes, gotchas
Context window exhaustion is silent by default. If your input exceeds the model’s context limit, providers typically truncate from the left — dropping the oldest tokens, which is usually the system prompt and early conversation history. Always count tokens before sending; don’t let the provider choose what to discard.
Temperature 0 is not truly deterministic. Floating-point non-associativity across hardware and batch configurations introduces variance even at T=0. For reproducibility in evals, run the same prompt multiple times and majority-vote, or use a fixed seed if the provider supports it.
BPE tokenization breaks arithmetic and dates. Numbers split into sub-word tokens lose positional semantics. "2048" might be one token; "2049" might be two. This is a structural property, not a capability gap — you cannot prompt your way out of it.
Repetition and mode collapse at low temperature. With T near 0, once the model enters a high-probability loop, there is no escape — each token raises the probability of the next identical token. Add frequency/presence penalties, set T ≥ 0.3, or use stop sequences to prevent degenerate outputs.
KV cache memory dominates at scale. When serving many concurrent long-context requests, KV cache VRAM consumption exceeds model weight memory. This is why long-context inference carries a premium price — it’s not the model complexity, it’s the per-request cache footprint.
Further reading
- Lilian Weng — The Transformer Family — her series on attention mechanisms and transformer variants is the most thorough technical treatment written for practitioners.
- Chip Huyen — LLM serving and inference — covers the production side of inference: batching, hardware, and cost optimization in depth.
- Simon Willison — LLMs and tokenization — grounded, empirical writing on how these systems behave (and misbehave) in real applications.
What to read next
- Text Embeddings: Turning Meaning into Geometry — how embedding models encode semantic meaning as dense vectors, the geometry of cosine similarity, and how to build semantic search on top of the context window model established here.
- Chunking Strategies for Retrieval — once you know what fits in the working set, this article covers how to slice external documents into the units that will be retrieved into it.