$ cat ai-engineering/prompt-caching.md

Prompt Caching: Reusing the KV Cache Across Calls

How prompt caching reuses the KV cache across API calls: Anthropic breakpoints, OpenAI's automatic prefix cache, Gemini context cache, and cost math.

Jatin Bansal@blog:~/ai-engineering$ open prompt-caching

A coding agent re-reads the same 40k-token system prompt — tool catalog, repo conventions, safety rules — on every turn. With prompt caching disabled, each turn pays the full prefill cost: 40k tokens of attention computation, at the input-token rate, for content that is byte-identical to the previous call. Turn it on and that 40k drops to 4k worth of compute (10% read price on most providers), with the prefill latency falling proportionally. The same lever applies to long retrieved-context RAG, few-shot prompts with fat exemplar pools, and any chat product with a stable system message. Prompt caching is the single highest-ROI optimization in production LLM serving, and there’s no quality trade-off — the model sees the same tokens; the provider just skips the work it already did.

Opening bridge

Yesterday’s piece on streaming covered the decode phase made visible to the user — token-by-token push over SSE, partial JSON for tool calls, cancellation propagated end-to-end. Today’s article sits on the other side of the same inference call: the prefill phase, which the LLM inference fundamentals piece framed as the O(n²) compute-bound step where every input token attends to every prior input token. Prefill is where long-context apps burn money. The KV cache that makes decode cheap also makes prefill skippable across requests — if you structure your prompts so the provider can identify the reusable prefix. This piece is the operational version of that idea.

What prompt caching actually is

The KV cache stores the attention keys and values computed for every token during prefill. Within a single decode loop, the KV cache amortizes attention so each new output token only computes against fresh state — that’s the intra-call optimization the inference fundamentals article covered. Prompt caching extends that idea across calls: persist the K/V tensors for a fixed prompt prefix on the inference machine, and the next request that arrives with the same prefix skips prefill on the cached region entirely.

What gets cached is the actual GPU-resident tensor state, not the input tokens. The provider hashes your prefix (token IDs, in order) to a content key; subsequent requests whose first N tokens hash to the same key get a cache hit. Cache hits are billed at roughly 10% of base input price, sometimes less; cache writes are billed at slightly above base input price (1.25× on Anthropic for 5-minute TTL, 2.0× for 1-hour TTL). The latency win is just as material as the cost win: prefill on a 100k-token prompt typically takes 2–5 seconds on frontier models; a cache hit shaves that down to a few hundred milliseconds.

The hard constraint: prefix-match only. The cache is keyed on a strict prefix of token IDs. Change one token at position 5 of a 50k prompt and the cache hit ends at position 4 — you write a new cache from token 5 onward. This is the rule that drives every structuring decision below.

The distributed-systems parallel

Prompt caching is memoization with an LRU, but the cache key is the prefix itself rather than an explicit argument list. The closer parallel from systems is prefix-shared B-tree pages: when many keys share a common prefix, the upper-tree pages are shared physically; only the leaves diverge. The LLM equivalent — many requests sharing the same 40k token system prompt and tool catalog — places the shared prefix in the upper tier of inference state and only diverges at the user-specific suffix.

The CDN parallel is even tighter. A cache hit on a CDN edge node depends on (a) the request mapping to a cache-eligible URL, (b) the URL hashing to the same edge node as a previous request, and (c) the cached object not having been evicted yet. Prompt caching has the same three preconditions:

CDN	Prompt cache
Cache-key normalization (query string, headers)	Prefix-byte identity, including whitespace and token boundaries
Edge-node routing (hash on URL/Host)	Inference-machine routing (hash on prefix, optionally `prompt_cache_key`)
Object eviction (LRU + TTL)	Prefix eviction (LRU per machine, TTL 5 min / 1 hr / 24 hr by provider)

The implication that catches teams: the cache is per machine, not global. If your traffic spreads across N inference replicas and each request lands on a random one, you write N copies of the cache before getting consistent hits. Anthropic and OpenAI both publish guidance that high-frequency cache use (>15 requests per minute per prefix) is when the routing actually converges. Below that volume, you’re paying cache-write prices on most requests. The fix is the same as a CDN: pin the route. OpenAI exposes prompt_cache_key for exactly this purpose — it’s a shard hint, not a key namespace. Setting prompt_cache_key = "tenant_42" increases the chance that all of tenant 42’s requests land on the same physical inference replica, which is where their warm cache lives.

The three provider models

The major providers expose the same underlying mechanism through three different control surfaces. Pick the abstraction you can live with.

Anthropic: explicit breakpoints

Anthropic’s prompt caching, introduced August 2024, is the most explicit of the three. You mark up to 4 cache breakpoints per request with cache_control: {"type": "ephemeral", "ttl": "5m" | "1h"}. Each breakpoint says “cache everything up to here, including the block this lives on.” The provider walks the request top-to-bottom in the order tools → system → messages and caches the longest prefix ending at the highest breakpoint it can find a hit for. TTL defaults to 5 minutes; the 1-hour option costs 2× standard input per write but is the right call for stable per-user contexts that flow through human-paced conversation.

Minimum cacheable sizes vary by model: 1,024 tokens for Sonnet 4.6 and 4.5, 4,096 tokens for Opus 4.7/4.6/4.5 and Haiku 4.5. Below the threshold the request runs uncached with no error — you’ll see cache_creation_input_tokens: 0 in the usage block. Cache writes cost 1.25× base for 5-minute TTL; cache reads cost 0.10× base — a 92% discount on every hit.

OpenAI: automatic prefix caching

OpenAI’s prompt caching is automatic on all supported models for prompts ≥1,024 tokens, with no markup required. The provider hashes the longest prefix that’s been previously computed (in 128-token increments) and routes the request based on that hash. Cached tokens cost roughly half of base input on the 4-family models and 10% on the 5-family models; the announced ceiling is “up to 90% input-token cost reduction and 80% latency reduction.” The response surfaces hits as usage.prompt_tokens_details.cached_tokens.

The single knob is the optional prompt_cache_key — a routing hint, not a namespace. Setting it improves hit rate for high-volume workloads with shared prefixes by giving the load balancer a stable shard key. Don’t make it too narrow (one key per user, low traffic per key, defeats sharing) or too broad (one key for everything, traffic overflows multiple machines, defeats stickiness). Per-tenant or per-application-version is the right granularity. The cache TTL is 5–10 minutes of inactivity (up to one hour maximum on most models), extending to 24 hours on the gpt-5.5 family.

Google Gemini: implicit + explicit

Gemini exposes two flavors. Implicit caching is on by default for Gemini 2.5 and later — same hands-off model as OpenAI, with a 90% discount on cache hits for the 2.5+ models (75% on 2.0). Explicit caching via the cachedContents resource gives you a named handle, your own TTL, and a guaranteed discount on every reference to that cache, in exchange for a per-hour storage cost ($4.50 / 1M tokens / hour on Pro models, $1.00 on Flash). Use implicit for chat-style workloads with natural prefix sharing; reach for explicit when you’re broadcasting one large context (a 200k-token book, a code repo dump) to many short user queries and want the discount nailed down in writing.

Structuring a prompt for cache hits

The structural rule across providers is the same: stable content first, dynamic content last. The prefix-match constraint means anything that changes between requests must live after the cache breakpoint. The standard layout, for a chat agent:

Tool definitions (most stable — change on deploy, not per-request)
System prompt and policy (stable — change on deploy)
Static context: knowledge base, retrieved documents valid for the session, persona profile (stable — change per session)
Conversation history (grows — new turns appended)
Current user message (changes every turn)

The cache breakpoint sits on the boundary between (3) and (4) for read-heavy contexts, or between (4) and (5) for sliding-conversation patterns. Each turn’s previous user/assistant pair becomes part of the next turn’s cacheable prefix — the breakpoint walks forward as the conversation grows. Anthropic explicitly supports this via the lookback rule: each breakpoint searches the prior 20 blocks for a previous cache write, so adding one new user turn doesn’t invalidate the cache built up across the last 19 turns.

Three structuring rules that catch teams in production:

No timestamps in the prefix. A system block that includes "As of 2026-05-18 12:19:01 UTC..." invalidates the cache every second the wall clock advances. Move the timestamp into the user message or strip it entirely; if the model genuinely needs the current time, inject it via a tool result, not the system prompt.

No per-request identifiers in the prefix. Embedding the user ID, request ID, or session ID in the system prompt — anywhere before the cache breakpoint — gives every user their own cache prefix and you write a fresh cache on every first call per user. Move identifiers into the user message or, for tool-using agents, into tool inputs/results.

Tool definitions are part of the prefix. Adding a tool, renaming a tool, or even changing the description field on an existing tool changes the tokenized prefix and invalidates every downstream cache — system, messages, the lot. Anthropic and OpenAI both warn that tool-schema churn is the most common cause of “I turned caching on and didn’t see the savings.” Treat your tool list like a deployable artifact: change it deliberately, not on every code commit. The same discipline applies to the system prompt itself — every newline shuffle, every “you are a helpful assistant” edit, is a cache flush.

Code: Python with the Anthropic SDK

Install: pip install anthropic. The Anthropic SDK is the canonical way to mark breakpoints, and the explicit breakpoint pattern is worth memorizing because the OpenAI/Gemini APIs are degenerate cases of it.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# pip install anthropic
from anthropic import Anthropic

client = Anthropic()

LONG_SYSTEM_CONTEXT = open("knowledge_base.md").read()  # 30k+ tokens

TOOLS = [
    {"name": "search", "description": "...", "input_schema": {...}},
    {
        "name": "get_document",
        "description": "...",
        "input_schema": {...},
        # Cache breakpoint at the end of the tools array. Everything above
        # (tools) gets cached when this fires.
        "cache_control": {"type": "ephemeral"},
    },
]

def chat(messages: list[dict]) -> dict:
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        tools=TOOLS,
        system=[
            {
                "type": "text",
                "text": "You are a research assistant. Cite sources.",
            },
            {
                "type": "text",
                "text": LONG_SYSTEM_CONTEXT,
                # Second breakpoint at the end of system. tools+system are
                # both cached as one prefix.
                "cache_control": {"type": "ephemeral", "ttl": "1h"},
            },
        ],
        messages=messages,
    )
    u = response.usage
    print(
        f"input={u.input_tokens}  "
        f"cache_read={u.cache_read_input_tokens}  "
        f"cache_write={u.cache_creation_input_tokens}  "
        f"output={u.output_tokens}"
    )
    return response.model_dump()

What to watch on the first vs second call. First call: cache_read_input_tokens=0, cache_creation_input_tokens equals the cached prefix size — you paid 1.25× base for the write (or 2× for the 1-hour TTL on the system block). Second call within the TTL window: cache_read_input_tokens equals the cached prefix size, cache_creation_input_tokens=0, and you paid 0.10× base on those tokens. The input_tokens field counts only the tokens after the last cache breakpoint — the dynamic user message and any new conversation turns. A successful cache strategy shows cache_read dominating input_tokens by 10× or more.

The TTL refresh rule is non-obvious: each use of a cached prefix resets its TTL. A 5-minute cache that gets hit every 4 minutes lives indefinitely. A 1-hour cache that goes 70 minutes between hits expires and gets re-written on the next request. Pre-warming — sending a max_tokens=0 request at session start to install the cache before the user types — is a real pattern; Anthropic documents it and most production deployments use it for the first-paint case.

Code: TypeScript with the OpenAI Responses API

OpenAI’s caching is automatic, so the code change is small. The interesting part is using prompt_cache_key to pin routing and reading the cached_tokens field. Install: npm install openai.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// npm install openai
import OpenAI from "openai";

const client = new OpenAI();

const SYSTEM_PROMPT = await Bun.file("system_prompt.md").text(); // ~5k tokens
const STATIC_CONTEXT = await Bun.file("knowledge_base.md").text(); // ~30k tokens

async function chat(
  tenantId: string,
  conversation: Array<{ role: "user" | "assistant"; content: string }>,
) {
  const response = await client.responses.create({
    model: "gpt-5.5",
    // Stable prefix first — the SDK concatenates these in order.
    input: [
      { role: "system", content: SYSTEM_PROMPT },
      { role: "system", content: STATIC_CONTEXT },
      ...conversation, // dynamic
    ],
    // Routing hint — pin tenant's requests to the same inference replica.
    // Not a key namespace; doesn't guarantee stickiness.
    prompt_cache_key: `tenant:${tenantId}`,
    max_output_tokens: 1024,
  });

  const usage = response.usage;
  console.log({
    input_tokens: usage.input_tokens,
    cached_tokens: usage.input_tokens_details?.cached_tokens ?? 0,
    output_tokens: usage.output_tokens,
  });

  return response.output_text;
}

The first call for tenant:42 writes the cache on whichever replica gets the request. Subsequent calls within 5–10 minutes (the default OpenAI TTL window) that hash to the same replica hit the cache. If tenant 42 makes 2 requests per minute, both will hit; if they make 0.5 requests per minute, the cache may evict between calls. The prompt_cache_key value increases the chance of stickiness, but it’s best-effort — at >15 RPM per prefix per machine, the request may overflow to a second machine and write a second cache.

A practical signal to monitor in production: cached_tokens / input_tokens. A healthy long-context app sits at 0.85–0.95 (most of the prefix is being read from cache). Below 0.5 means your cache is mostly missing — usually because the prefix isn’t stable, or traffic per prefix is too low for routing to converge.

Cost math worth doing once

Take a coding agent with a 40k-token system prompt and tool catalog, average 8 turns per session, 2,000 average user-message tokens, 1,500 average assistant tokens. Frontier-model pricing of $15/M input, $75/M output is a reasonable mid-2026 stand-in.

Without caching, each turn pays 40k + history input. The 8-turn session totals roughly 320k + 64k history = 384k input tokens, plus 12k output. Cost: 384k × $15/M + 12k × $75/M = $5.76 + $0.90 = $6.66/session.

With caching at 10% read price, the system prompt is paid full price once (1.25× for the write = $0.75) and 0.1× on every subsequent read across 7 more turns (7 × 40k × $1.5/M = $0.42). Conversation history rolls into the cache too: the last few turns reread at 0.1×. Total session cost falls to roughly $1.50–$2.00. A ~70% session cost reduction, no quality change, no UX change — just structural prompt discipline.

The latency story is the same kind of free win. Prefill latency for a 40k-token Anthropic call is typically 1.5–3 seconds; the cache-read path runs in 100–300ms. That’s three seconds of TTFT recovered for every cached turn, on top of whatever streaming buys you. For interactive agents this often matters more than the cost.

Trade-offs, failure modes, gotchas

The cache is per inference replica, not global. Low-traffic workloads with no prompt_cache_key discipline behave as if they had no cache — the routing layer fans requests out, each machine writes its own copy, none survive long enough to be hit. Below ~15 RPM per shared prefix, prompt caching often doesn’t pay for itself. The mitigation is volume aggregation: route all of tenant X’s traffic through the same shard key, or batch requests through a worker that holds a warm session.

A single byte change at the wrong position invalidates everything downstream. This is the single hardest failure mode to debug because the API doesn’t tell you which byte broke the prefix match — you just see cache_creation_input_tokens ticking up where you expected cache_read. Diff your serialized request top-to-bottom against the last known cache-hit version. The common culprits: a re-ordered key in a JSON tool input schema, a trailing newline that disappeared when someone re-saved a markdown file, an interpolated user name in the system message (“Hi, Alice, welcome back…”), a date stamp injected by a logging middleware.

TTL is wall-clock from last hit, not from creation. A 5-minute Anthropic cache that gets hit every 4 minutes lives forever; one that goes 6 minutes without a hit is gone and the next call writes a new cache at full write price. Plan TTLs around your traffic shape: 5 minutes works for active chat sessions, 1 hour fits human-paced flows with multi-minute thinking gaps, 24 hours (OpenAI 5.5 family) is for long-running batch workloads.

Cache writes are more expensive than base input. A single-shot prompt that you don’t expect to repeat is cheaper without caching — you pay 1.25× write cost for zero reads. The break-even is roughly the second use within the TTL window. Don’t reflexively wrap every system prompt in cache_control; cache only what you reuse.

Tool-schema churn is invisible but expensive. If your tool definitions are generated from code (e.g., Pydantic models, Zod schemas) and the generation order isn’t stable, every deploy can shuffle the serialized JSON and invalidate every cache. Snapshot the canonical tool JSON in your build pipeline and diff it across deploys; treat tool-schema changes as a deliberate breaking change. The tool-use article covered the call/result loop; cache hygiene is the production complement.

Caching plays badly with high-cardinality per-user state if you put it in the prefix. Each unique user-prefix combination creates a separate cache entry. If your “static context” is actually per-tenant (their data, their policies), you’re writing N caches across N tenants, all competing for limited cache slots on each replica. Either accept the write cost and pin routing per tenant (prompt_cache_key), or move the per-tenant content into the user message and keep the prefix tenant-agnostic.

Implicit caching is auditable in usage, but the cache itself is not. OpenAI and Gemini implicit caching give you no API to query “is my prefix cached on this machine right now?” You infer the state from cached_tokens. For systems where cache state matters operationally (warm starts, deployment rollouts, region failover), explicit caching (Anthropic breakpoints, Gemini cachedContents) is the only path to determinism.

Context engineering and caching tension at the seams. JIT context loading fetches only what the next step needs — minimizing token cost per call, but defeating prefix caching because each call’s prefix is different. AOT assembly pre-packs the prompt — large prefixes, perfect for caching, but more tokens per call. The right answer is usually a tiered split: cache the stable AOT layer (tools, policies, persona, foundational documents), JIT-fetch the volatile suffix. The structuring tax for caching is real; pay it where the volume justifies it.

Cache invalidation on model upgrades. Switching from claude-sonnet-4-6 to claude-opus-4-7 invalidates every cache on the old model — they’re keyed per model. The same applies across point releases of the same model family. Coordinate cache-warming with model rollouts, or accept a one-window cold-start cost on cutover.