jatin.blog ~ $
$ cat ai-engineering/llm-inference-fundamentals.md

LLM Inference: Tokens, Context, and Sampling

How LLMs actually process text: tokenization with BPE, the context window as working memory, KV caching, and sampling parameters that control output variance.

Jatin Bansal@blog:~/ai-engineering$ open llm-inference-fundamentals

Every LLM API call is a black box until it isn’t. Understanding what actually happens inside — tokenization, autoregressive decoding, the KV cache, sampling — changes how you architect AI systems. It tells you why context-heavy applications cost more, why determinism requires explicit effort, and why filling a context window is never free.

Opening bridge

This is the first article in the AI Engineering curriculum. It lays the mechanical foundation for everything that follows: RAG, memory systems, agents, evals. Before you can reason about managing context or retrieving relevant documents, you need to know what the context window is, what it costs, and how text becomes tokens becomes probabilities becomes output.

What an LLM actually is

An LLM is a function that maps a sequence of tokens to a probability distribution over the next token:

text
1
P(next_token | token_1, token_2, ..., token_n)

That’s the whole thing. You call it once, sample a token from the distribution, append it to the sequence, and call it again. Repeat until you hit a stop condition or max_tokens. This is autoregressive generation.

The model has no memory between calls. No state, no hidden accumulator. Everything it “knows” during a given inference call is in the input sequence — the context window.

Tokenization: text → integers

Before text touches the model, a tokenizer converts it to a sequence of integer IDs from a fixed vocabulary (typically 50k–200k tokens).

Modern LLMs use Byte Pair Encoding (BPE) — a compression algorithm applied to text. BPE begins with individual bytes and iteratively merges the most frequently adjacent pair into a new token, repeating until the target vocabulary size is reached. The resulting vocabulary assigns single tokens to common substrings ( the, ing, function) while rare sequences fall back to byte-level representations.

Practical consequences that bite engineers:

  • Tokens ≠ words. "unfortunately" is one token; "ундовищен" might be six.
  • Whitespace is part of the token. cat (leading space) and cat are different token IDs.
  • Numbers tokenize poorly. "12345" might split as ["12", "345"] — this is why base LLMs are bad at arithmetic.
  • Code is efficient. Language keywords and common identifiers get single tokens; your prompts about code are cheaper than you think.
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# pip install tiktoken
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 / most Claude evals use this

text = "Attention is all you need."
tokens = enc.encode(text)
boundaries = [enc.decode([t]) for t in tokens]

print(f"Token count: {len(tokens)}")    # 6
print(f"Token IDs:   {tokens}")
print(f"Boundaries:  {boundaries}")    # ['Attention', ' is', ' all', ' you', ' need', '.']
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// npm install tiktoken
import { getEncoding } from "tiktoken";

const enc = getEncoding("cl100k_base");
const text = "Attention is all you need.";
const tokens = enc.encode(text);

const boundaries = Array.from(tokens).map((t) =>
  new TextDecoder().decode(enc.decode(new Uint32Array([t])))
);

console.log(`Token count: ${tokens.length}`);   // 6
console.log(`Boundaries:  ${JSON.stringify(boundaries)}`);
enc.free(); // release WASM memory — easy to leak

tiktoken is OpenAI’s tokenizer library, widely used as a token-counting standard across providers.

The context window as working memory

The context window is the model’s total working memory for a single inference call. Every token — system prompt, conversation history, retrieved documents, tool outputs, user message — competes for this finite space.

Current limits (mid-2026): Claude 3.5 Sonnet at 200k tokens, GPT-4o at 128k, Gemini 1.5 Pro at 1M+. Capacity is not performance. Long-context attention degrades in practice: models “lose” information in the middle of very long contexts, and precision on complex reasoning tasks drops significantly past ~30k tokens even on models that technically support 200k.

Cost model: Transformer attention is O(n²) in sequence length during the prefill phase — every input token attends to every prior input token. A 100k-token prompt is ~10,000× more compute-intensive in the attention layers than a 1k-token prompt.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# pip install tiktoken
import tiktoken

def token_count(messages: list[dict[str, str]]) -> int:
    enc = tiktoken.get_encoding("cl100k_base")
    return sum(len(enc.encode(m["content"])) for m in messages)

def fits_in_window(
    messages: list[dict[str, str]],
    max_output: int = 1024,
    limit: int = 200_000,
) -> bool:
    return token_count(messages) + max_output <= limit
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import { getEncoding } from "tiktoken";

function tokenCount(messages: Array<{ role: string; content: string }>): number {
  const enc = getEncoding("cl100k_base");
  const total = messages.reduce((sum, m) => sum + enc.encode(m.content).length, 0);
  enc.free();
  return total;
}

function fitsInWindow(
  messages: Array<{ role: string; content: string }>,
  maxOutput = 1024,
  limit = 200_000
): boolean {
  return tokenCount(messages) + maxOutput <= limit;
}

The distributed systems parallel

The context window maps cleanly onto the working set model from virtual memory:

LLM conceptSystems analogue
TokenCache line — smallest unit of information
Context windowWorking set — everything currently resident in L1/L2
KV cacheTLB — memoizes expensive computations for recently seen pages
RAG retrievalPage fault handler — fetches from slower storage on demand
TemperatureNetwork jitter — controlled randomness to escape degenerate fixed points

The working set insight is the most operationally important one: if the information required for a task is not in the context, the model cannot access it. It will confabulate. RAG, tool calls, and extended context models are three different strategies for expanding the effective working set, each with different latency and cost profiles.

KV cache: amortizing attention cost

Without optimization, generating N output tokens requires N forward passes, each attending to all prior tokens. The KV cache stores the intermediate attention keys and values for every processed token. On each new decode step, only the single new token’s K/V vectors are computed; the rest are read from cache.

This produces two distinct operational phases with very different characteristics:

  • Prefill — processing the full input prompt. Compute-bound, expensive, and unavoidable on a cold cache. This is where O(n²) attention cost is paid.
  • Decode — generating each output token. Memory-bandwidth-bound, fast, and cheap relative to prefill. You’re reading a large cache on each step.

Prompt caching (offered by Anthropic, OpenAI, and others) persists the KV cache for a fixed prompt prefix across API calls. If your system prompt is 10k tokens, prompt caching turns 10k tokens of prefill cost into a cache hit on every subsequent call. This is one of the highest-ROI optimizations for production AI systems.

The KV cache also explains why long-context inference is expensive even if you only generate a short response: serving a 200k-token context requires tens of gigabytes of KV cache per concurrent request, consuming GPU memory that could otherwise serve additional users.

Sampling: from probabilities to text

After the model computes logits (raw scores over the full vocabulary), the decoding pipeline applies:

  1. Temperature scaling — divide all logits by T before softmax:

    • T → 0: greedy decoding, always picks the highest-probability token
    • T = 1: sample directly from the model’s output distribution
    • T > 1: flatter distribution — more variance, more likely to be incoherent at extremes
  2. Top-k — zero out all tokens except the k highest-probability ones before sampling. Hard cutoff, fixed k.

  3. Top-p (nucleus sampling) — zero out all tokens outside the smallest set whose cumulative probability ≥ p. Adaptive cutoff — the effective k varies with how peaked or flat the distribution is.

In practice: use temperature=0 for factual extraction, classification, and structured output. Use temperature=0.7, top_p=0.9 for open-ended generation. Above temperature=1.2 you’re usually just generating noise.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# pip install anthropic
import anthropic

client = anthropic.Anthropic()

def generate(prompt: str, temperature: float, top_p: float = 1.0) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=80,
        temperature=temperature,
        top_p=top_p,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text.strip()

prompt = "Finish the sentence: A reliable distributed system requires"

for t in [0.0, 0.7, 1.2]:
    print(f"T={t}: {generate(prompt, temperature=t)}")
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// npm install @anthropic-ai/sdk
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function generate(
  prompt: string,
  temperature: number,
  topP = 1.0
): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 80,
    temperature,
    top_p: topP,
    messages: [{ role: "user", content: prompt }],
  });
  return (response.content[0] as Anthropic.TextBlock).text.trim();
}

const prompt = "Finish the sentence: A reliable distributed system requires";
for (const t of [0.0, 0.7, 1.2]) {
  console.log(`T=${t}: ${await generate(prompt, t)}`);
}

The Anthropic Messages API covers the full parameter reference including top_k (available on Claude but not exposed by all clients). The Vercel AI SDK provides a unified interface if you need to switch providers.

Trade-offs, failure modes, gotchas

Context window exhaustion is silent by default. If your input exceeds the model’s context limit, providers typically truncate from the left — dropping the oldest tokens, which is usually the system prompt and early conversation history. Always count tokens before sending; don’t let the provider choose what to discard.

Temperature 0 is not truly deterministic. Floating-point non-associativity across hardware and batch configurations introduces variance even at T=0. For reproducibility in evals, run the same prompt multiple times and majority-vote, or use a fixed seed if the provider supports it.

BPE tokenization breaks arithmetic and dates. Numbers split into sub-word tokens lose positional semantics. "2048" might be one token; "2049" might be two. This is a structural property, not a capability gap — you cannot prompt your way out of it.

Repetition and mode collapse at low temperature. With T near 0, once the model enters a high-probability loop, there is no escape — each token raises the probability of the next identical token. Add frequency/presence penalties, set T ≥ 0.3, or use stop sequences to prevent degenerate outputs.

KV cache memory dominates at scale. When serving many concurrent long-context requests, KV cache VRAM consumption exceeds model weight memory. This is why long-context inference carries a premium price — it’s not the model complexity, it’s the per-request cache footprint.

Further reading

  • Text Embeddings: Turning Meaning into Geometry — how embedding models encode semantic meaning as dense vectors, the geometry of cosine similarity, and how to build semantic search on top of the context window model established here.
  • Chunking Strategies for Retrieval — once you know what fits in the working set, this article covers how to slice external documents into the units that will be retrieved into it.