$ cat ai-engineering/context-engineering-jit-vs-aot.md

Context Engineering: JIT vs AOT Context Loading

Context as the scarcest resource in an LLM call: how AOT prepacking and JIT retrieval compose, and the OS prefetch-vs-demand-paging parallel.

Jatin Bansal@blog:~/ai-engineering$ open context-engineering-jit-vs-aot

A team migrates their support assistant from Claude Sonnet 4.5 to Sonnet 4.6 and the 1M-token context window. The new plan is simple: stop maintaining the retrieval pipeline, dump the entire 800k-token knowledge base into every system prompt, let the model find what it needs. The first week, eval scores fall. Faithfulness drops three points; multi-hop accuracy collapses. The model is technically getting more information per call and is somehow worse at using it. A bigger window did not make the context-management problem go away; it relocated it. The constraint that was “the index doesn’t have it” became “the model can’t find it in the haystack you sent.” That migration is the entire field of context engineering compressed into one painful sprint.

Opening bridge

The RAG evaluation piece earlier today ended on a number: recall@retrieve_k vs recall@k, the two checkpoints in a retrieval cascade. That framing implicitly assumes a particular shape — retrieve some documents, hand them to the generator, done. Today’s piece sits one layer above. Once you have a cascade that can produce the right candidate set, you still have to decide when the model sees those candidates: all of them, prepacked into the prompt before the call, or fetched on demand as the model works through the problem. That decision is context engineering, and it is the bridge from RAG (a single retrieve-then-generate hop) into the Agents subtree we’ll open after this article.

What context engineering actually is

Context engineering is the discipline of deciding, for every call to the model, which tokens go in the window and when. The term was popularized in Anthropic’s “Effective context engineering for AI agents” post (September 2025), which framed it as “the set of strategies for curating and maintaining the optimal set of tokens during LLM inference.” It is the superset of prompt engineering — the prompt is one tile of the context — and it generalizes RAG, which is one specific strategy for how to fill the rest.

The shift in vocabulary matters because the constraint moved. When the working assumption was a 4k or 8k window, the question was “which retrieval results fit?” When the window stretched to 200k or 1M, the question became “which tokens should be there, given that you can put almost anything in but the model will use only some of it well?” Context becomes the scarcest resource in the system after the model itself is fixed, and the cost of misallocating it is paid in two ways: dollars per token and, more importantly, attention diluted across irrelevant material.

The first-principles intuition

Pretend the model is a junior engineer with perfect short-term memory and zero persistent memory. For every task, you hand them a packet of materials and ask them to produce an answer. Two ways to do it: hand them a binder with everything that might be relevant (“here’s the entire wiki, find what you need”) or hand them a search slip and a phone (“call the librarian when you need something”). The first is faster per question because there are no round-trips; it is also wasteful per question because most of the binder is unread, and it gets worse as the binder gets thicker — past some thickness the engineer starts skimming and misses things. The second is slower per question but uses materials proportional to what the question actually requires.

That is the AOT/JIT split. AOT (ahead-of-time) context assembly packs the prompt with retrieved material before the call: classical RAG, system-prompt stuffing, conversation-history concatenation. JIT (just-in-time) context retrieval equips the model with tools and lets it pull material into the window as it discovers what it needs: agent loops, MCP servers, search-as-a-tool. Real production systems compose both — a thin AOT base layer of universally needed material, a JIT layer for everything the specific task requires.

The distributed-systems parallel

This is prefetching vs demand paging in an operating system, lifted intact. Prefetching loads pages into RAM before they’re referenced, betting on locality of access; demand paging waits for a page fault and pays the latency at the moment of need. Prefetching wins when the access pattern is predictable and the working set is small enough to fit; demand paging wins when access is irregular and the full address space is too large to keep resident. The trade is identical for LLMs: the prompt is the RAM, the corpus is the address space on disk, and the choice is whether to pay the latency of a tool round-trip per access (JIT) or the cost of carrying unused tokens in every call (AOT).

The deeper parallel is the working-set model: at any moment in a task, only some fraction of the corpus is actually useful, and the model’s performance depends on that working set fitting comfortably in the context — not the full window, but the attended portion of it. Context rot (Chroma’s research) shows what happens when you violate the working-set assumption: input length increases, recall decreases, and the model’s effective working set is empirically much smaller than the advertised window. That paper tested 18 frontier models including Claude Opus 4, GPT-4.1, and Gemini 2.5; every one degraded as the window filled. NoLiMa (Modarressi et al., ICML 2025) puts a number on it: 11 of 13 evaluated models drop below 50% of their short-context score at 32k tokens. The lost-in-the-middle paper (Liu et al., 2023) — already a reference point in the reranking and RAG evaluation articles — established the positional version of the effect: tokens in the middle of the prompt get attended to far less than tokens at the ends.

The operational consequence is the same one operating systems learned in the 1960s: a process that touches more pages than will fit in its working set thrashes. An LLM call that loads more context than fits in its attended working set hallucinates, ignores instructions, and starts retrieving from the middle of its training distribution instead of from the prompt. Bigger window did not retire the working-set problem; it just raised the ceiling.

Mechanics of AOT context assembly

AOT is the default and the simpler of the two. The pipeline is the cascade we built across the retrieval, hybrid search, reranking, and query transformation articles, terminated by an assembly step:

Run the retrieval cascade against the user’s query.
Take the top-K reranked chunks (typically 5–25 — measured, not guessed, against the eval set).
Concatenate into a prompt template: instructions, retrieved context, user question.
Call the model. Single round-trip.

Two assembly decisions matter more than people credit:

Order. The lost-in-the-middle effect is positional, so put the highest-ranked chunks at the very top and very bottom of the context block, with weaker ones in the middle. This is a free win that costs nothing to implement and is consistently reproducible across models.

Format. Prose paragraphs are the worst format for facts the model needs to use deterministically. A structured payload — JSON, YAML, or even a markdown table — outperforms prose for things like “the user’s plan is enterprise,” “the dashboard refresh interval defaults to 30 seconds,” “tickets in state pending_customer are not counted in SLA.” Reserve prose for the corpus passages where the source format is prose; switch to structure for system facts and tool results. The reason is mechanical: structured fields land in repeatable positions and the model’s attention learns to fetch them by key rather than scanning for them. The mirror image on the output side is structured output — schema-constrained decoding for the model’s reply, so downstream code receives typed objects instead of prose.

Mechanics of JIT context retrieval

JIT inverts the control flow. Instead of pre-packing the answer’s raw material, you equip the model with retrieval tools and let an agent loop drive the lookups. The mechanics:

The model receives a thin context: instructions, available tool schemas, the user’s question, no corpus material.
The model emits a tool call: search_docs(query="dashboard refresh interval").
The harness executes the tool, returns the result into the conversation.
The model decides whether it has enough to answer, or emits another tool call.
Loop until the model produces a final answer or a step limit is hit.

The cost shape is different. AOT pays K chunks of context per call, every call; JIT pays only the chunks the model actually fetched, but pays N model calls instead of 1. On a question the model answers in two tool hops, JIT loads roughly 2/K of the AOT context but costs ~3× the model invocations. The break-even depends on average hops per task, your per-call latency budget, and the cost of the tokens you’d otherwise pre-pack. As tasks get longer and more variable (the agent regime), JIT wins by a widening margin; on a tight single-turn QA endpoint, AOT usually still wins.

The other thing JIT buys is demand-driven exploration: the model can change its mind about what it needs based on what it found. A pre-packed prompt that includes the wrong 10 chunks is stuck with them; a tool-driven loop can re-query, decompose, or abandon a line of inquiry the way the decomposition pattern in query transformations does — except now the decomposition is implicit in the loop rather than a separate up-front step.

Code: AOT in Python with the Anthropic SDK

The AOT version is the shape every RAG tutorial shows. Install: pip install anthropic. The interesting decisions are the assembly: ordering, structured framing for deterministic facts, prose only where the source is prose.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import anthropic

client = anthropic.Anthropic()

SYSTEM = """You are a support assistant. Answer ONLY from the
<context> block. If the answer is not in the context, say so."""

def aot_answer(question: str, chunks: list[dict], facts: dict) -> str:
    # Order chunks: best at top and bottom, weakest in the middle.
    # `chunks` is already reranker-sorted: [best, ..., worst].
    n = len(chunks)
    ordered = []
    for i in range(n):
        ordered.append(chunks[i] if i % 2 == 0 else chunks[n - 1 - i // 2])
    seen, deduped = set(), []
    for c in ordered:
        if c["id"] not in seen:
            seen.add(c["id"])
            deduped.append(c)

    context_block = "\n\n".join(
        f"<doc id={c['id']} source={c['source']!r}>{c['text']}</doc>"
        for c in deduped
    )

    # Structured facts go as JSON, not prose: deterministic key-based recall.
    user_block = (
        f"<facts>{facts}</facts>\n"
        f"<context>\n{context_block}\n</context>\n"
        f"<question>{question}</question>"
    )

    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {"type": "text", "text": SYSTEM,
             "cache_control": {"type": "ephemeral"}},
        ],
        messages=[{"role": "user", "content": user_block}],
    )
    return msg.content[0].text

Two things worth flagging. The system prompt is wrapped with cache_control so the prefix-cache hits on every call with the same instructions — see Anthropic’s prompt-caching docs for the cache mechanics; we’ll do a full article on it in the Generation Control subtree. The <facts> block is JSON-serialized intentionally: the model attends to it by key the way it would attend to a tool result, rather than parsing it out of a paragraph.

Code: JIT in TypeScript with the Vercel AI SDK

The JIT version reads as an agent loop. The model gets two tools — search and fetch — and the runtime drives the loop until the model emits a final answer or the step cap fires. The Vercel AI SDK’s generateText with tools and stopWhen handles the loop natively. Install: npm install ai @ai-sdk/anthropic zod.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import { anthropic } from "@ai-sdk/anthropic";
import { generateText, stepCountIs, tool } from "ai";
import { z } from "zod";

type Hit = { id: string; source: string; snippet: string };
declare function searchIndex(q: string, k: number): Promise<Hit[]>;
declare function fetchChunk(id: string): Promise<string>;

const tools = {
  search_docs: tool({
    description: "Hybrid search over the support corpus. Returns id + snippet.",
    inputSchema: z.object({
      query: z.string(),
      top_k: z.number().int().min(1).max(10).default(5),
    }),
    execute: async ({ query, top_k }) => searchIndex(query, top_k),
  }),
  fetch_chunk: tool({
    description: "Return the full text of a chunk by id.",
    inputSchema: z.object({ id: z.string() }),
    execute: async ({ id }) => fetchChunk(id),
  }),
};

export async function jitAnswer(question: string): Promise<string> {
  const { text, steps } = await generateText({
    model: anthropic("claude-sonnet-4-6"),
    system:
      "You are a support assistant. Use search_docs to find candidates, " +
      "then fetch_chunk only on candidates you will actually cite. " +
      "Answer only from fetched chunks. Stop as soon as you can answer.",
    tools,
    stopWhen: stepCountIs(8),
    prompt: question,
  });
  console.log(`steps=${steps.length}`);
  return text;
}

Three points to notice. First, the tool docstrings are themselves context — the model reads them on every call, so they need to be terse and unambiguous. Second, stopWhen: stepCountIs(8) is the budget: an unbounded JIT loop is a fork bomb waiting to happen, and a step cap is the cheapest first line of defense (we’ll go deep on agent budgets later in the curriculum). Third, the system prompt explicitly steers the search-then-fetch pattern: search returns cheap snippets, fetch pulls the full chunk only for the candidates the model is going to cite. That is JIT done well — the model doesn’t pull the whole corpus into context, it pulls exactly the snippets and the few full chunks it needs.

The hybrid that wins in production

Pure AOT and pure JIT are both straw versions. The dominant production shape is AOT for the predictable spine, JIT for the long tail. The spine is the small set of high-signal tokens that essentially every task in your domain needs: a system prompt describing the assistant’s role, a user profile or workspace metadata, the schema of the current task, maybe 3–5 universally relevant chunks from a small “always-on” cache. That spine is small, stable, and cacheable — the model attends to it reliably because it lands in the same positions every call and the prefix cache makes it nearly free.

Around the spine, JIT tools handle the rest. The agent loop fetches what the current task actually needs, the spine handles the rest. This composition gets you the latency and reliability of AOT for the common path with the recall and adaptability of JIT for the tail — and it keeps the attended working set small enough that context rot doesn’t bite.

Trade-offs, failure modes, gotchas

Bigger windows don’t make context engineering optional, they raise the stakes. The 1M-token Sonnet 4.6 context window is impressive and you should not fill it. Every long-context benchmark — NoLiMa, Chroma’s context-rot study, lost-in-the-middle — shows performance falling off long before the hard limit. The effective working set is in the low tens of thousands of tokens for most tasks, not the advertised window.

AOT’s silent failure mode is attention dilution. A prompt with 50 marginally-relevant chunks scores worse than the same prompt with 8 carefully-chosen ones, even though the second prompt is missing 42 candidates the first one had. Recall doesn’t translate to answer quality past a point; precision in what you load is the underrated lever.

JIT’s silent failure mode is tool-loop drift. Models can over-search, re-query in circles, or fetch chunks they don’t end up using. Both behaviors cost tokens and inflate latency. Mitigations: tight step caps, an explicit “no-progress” detector (two identical tool calls in a row = abort), and judging the model’s tool-use efficiency as a first-class eval metric alongside answer quality.

The KV/prompt cache rewards AOT structure. If the first half of your context is identical across 95% of your traffic, prompt caching on Claude/OpenAI/Gemini will drop your bill on that prefix to a tiny fraction of the uncached rate. JIT tool results arrive at unpredictable positions and don’t cache. Structure your prompt with the stable parts first (system, role, schemas) and the volatile parts (retrieved chunks, user message) at the end; this single rearrangement can move cache hit rate by 20+ points.

Structured payloads beat prose for facts the model needs to use deterministically. “The plan is enterprise” inside a JSON <facts> block is more reliably retrieved than the same sentence buried in a paragraph. The reason is positional: structured fields are at fixed offsets and the model’s attention learns to fetch by key, while prose facts get the same lost-in-the-middle treatment as everything else.

Conversation history is the most-ignored context. Multi-turn agents accumulate tool calls, observations, and assistant messages that quickly dominate the window. The naive “include every prior turn” policy is what produces the conversation-grows-until-the-API-rejects-it failure mode. Truncation, summarization, and compaction are the JIT analogues for message history; we’ll spend several articles on them in the Memory subtree.

Eval before, eval after. Every change to the assembly logic — adding a new structured field, reordering chunks, increasing the JIT step cap — should run against the RAG evaluation harness. Context engineering changes are easy to ship and hard to feel in dev; the eval set is the only honest signal.