Short-Term Memory: Managing the Conversation Buffer
Truncation policies for the LLM conversation buffer: sliding windows, token-level vs message-level eviction, system-prompt protection, headroom budgeting.
A coding agent’s session crosses turn 60 and the next model call returns context_length_exceeded. The harness catches the error, drops the first 30 messages, retries, and the call succeeds — except the dropped messages contained the system prompt and the user’s original task description. The model now happily continues whatever it remembers from the last 30 turns, which means it answers the question on screen but has forgotten why the user started this session in the first place. The agent’s behavior visibly drifts. No exception is raised. No alert fires. This is the failure mode short-term memory exists to prevent: the conversation buffer is the only thing standing between “the model has working memory” and “the model has amnesia at every API boundary,” and getting its eviction policy wrong is the single most common production bug in agent harnesses.
Opening bridge
The last two pieces drew the memory stack map and then went deep on the four cognitive memory types. Both flagged short-term memory — the conversation buffer, the in-context tier — as the layer that the rest of the stack hangs off. The reason is mechanical: every byte that reaches the model travels through this buffer. The episodic store can have a million entries; if your buffer eviction drops the message that triggered the retrieval, the retrieval was for nothing. The semantic store can pin the user’s name; if your buffer truncation cuts the system prompt that injects the pinned facts, the model never sees them. Short-term memory is the gatekeeper tier — the one that decides what crosses the API boundary on the next call — and that’s why it gets its own article before the more glamorous long-term layers.
Definition
Short-term memory is the deliberately bounded, in-context message list that the harness assembles and sends to the model on every turn. Three properties make it short-term-specific. First, it lives across turns within a single logical session but evaporates across sessions by default. Second, its size is bounded by something — token count, message count, or a budget the harness chose — and the harness has to make explicit eviction decisions when the next turn would breach the bound. Third, its contents are the prompt the model actually sees, which means every eviction decision is also an attention decision: the model can only reason about bytes that survived the eviction pass.
What short-term memory is not, in this taxonomy: it is not the entire conversation history stored in your database. The full history is the log; short-term memory is the working set the harness reconstructs from the log for the next call. Conflating the two is the first design mistake. The log can grow unbounded; the working set must not.
Intuition
The mental model that pays off is a bounded ring buffer with an attention-weighted hot section in the middle. The system prompt is the head, pinned and immutable. The most recent few turns are the tail, freshest and most relevant. Everything in between is contested space — the model attends to it less reliably (the lost-in-the-middle effect we’ll get to), and it’s also the easiest material to evict without obvious quality loss. The harness’s job each turn is to slide that ring forward by N tokens, decide which slots to overwrite, and keep at least max_tokens_for_reply of headroom so the model has room to actually respond.
The cleanest analogy from systems is the TCP receive buffer with explicit headroom reservation. The receiver advertises a window size that includes a deliberate gap between current buffer occupancy and the actual buffer limit; without that gap, the next packet wedges the buffer at zero free space and the connection stalls. A conversation buffer that fills to the model’s stated context limit will fail the next call, because the model needs free space to emit tokens. Always reserve headroom; treat the context window as total - headroom, not as total.
The distributed-systems parallel
Three honest parallels worth naming explicitly.
LRU cache replacement is the canonical eviction policy and it’s also wrong as a default for conversations. LRU treats the buffer as a hot-cold gradient where the oldest entry is the coldest. Conversations don’t have that shape — the first user message (which sets the task) is older than every subsequent turn but often more important than any of them. A plain LRU eviction silently deletes the task description first; that’s the bug in the hook. The conversations equivalent of “pin the page” is the system-prompt-and-task-anchor protection rule we’ll formalize below: certain entries are never evictable regardless of age. Once that rule is in place, LRU works fine for the rest.
Headroom-budgeting is the same problem as the JVM’s -Xmx minus Survivor + Eden reservation, or the database connection-pool “leave N for emergency queries” pattern. The pattern: never let the working set use 100% of available capacity, because the act of producing the next unit of work itself requires capacity, and a buffer at 100% utilization deadlocks. The Anthropic SDK’s server-side context editing calls this the trigger threshold; the OpenAI Agents SDK’s OpenAIResponsesCompactionSession exposes a similar should_trigger_compaction hook. Naming notwithstanding, it’s the same trick: maintain a watermark below the hard ceiling, evict on watermark crossing, leave the gap.
The conversation buffer is a log compaction problem, not a queue problem. Queues are FIFO; you drop from the head. The conversation buffer drops from the middle, preserving both the head (the task) and the tail (the recent turns). The closest match in databases is Kafka log compaction: the log keeps the latest value per key, drops older versions of the same key, and never drops the schema header. The conversation parallel is exact — you keep the task anchor, the latest tool result per tool, the recent N user/assistant turns, and you compact out the older redundant intermediates. The detailed compaction subtree article (coming later in this Memory subtree) will return to this; for short-term memory the relevant takeaway is that “evict the oldest” is the wrong primitive — “evict the least load-bearing middle entries” is the right one.
The mechanics: what’s in the buffer
Every turn, the harness reconstructs a message array of the shape every modern Chat-Completions-style API accepts:
- System prompt(s). One or more, but always at index 0. Sets behavior, injects pinned semantic facts (from the semantic memory tier), declares tools, declares output schemas.
- Task anchor. The first user message of the session — the one that defines what the agent is trying to accomplish. Treated as protected in any non-trivial harness.
- Retrieved context. RAG hits, memory-store retrievals, tool definitions injected just-in-time. Lives near the system prompt or just before the latest user turn, depending on the JIT-vs-AOT context engineering policy.
- Conversation tail. The most recent N user/assistant turns and any tool-call/tool-result pairs they generated.
- The current user turn. Always at the end. Always protected.
Eviction operates on the middle: the conversation between the task anchor and the recent tail. Within that middle, the harness chooses what to drop. Below are the policies that actually ship.
Policy 1: Hard sliding window
Keep only the last K messages. Drop the oldest. Linear, simple, and the default in nearly every framework’s getting-started tutorial.
| |
When this is right. Single-purpose assistants with tight token budgets where the next turn is overwhelmingly the most important context. Customer-support bots that handle one issue per session. Single-prompt chat UIs.
When this is wrong. Any session where the original user message matters past turn K. Coding agents (the task description matters at turn 200). Multi-step research agents (the original question matters even when the current step is debugging the JSON parser). The fix is the next policy.
Policy 2: Head-tail (task-anchored sliding window)
Keep the system prompt, keep the first user message, keep the last K messages. Drop the middle.
| |
When this is right. Any agent whose system prompt is the single most important context (almost all of them) and whose task description is the second most important context (most of them). This is the policy LangGraph’s trim_messages with include_system=True and strategy="last" defaults to.
The subtle bug. Hard cuts at the boundary between “first message” and “last K messages” can break message-pair integrity — you can end up with a tool result for a tool call that’s no longer in the buffer. Every production buffer manager must respect the tool-call/tool-result pairing invariant: if you drop a tool call, drop its result; if you drop a result, drop its call. Otherwise the model sees orphan messages and behavior gets weird.
Policy 3: Token-budget eviction
Same head-tail shape, but measured in tokens instead of messages. Count tokens for each message; keep adding from the tail backwards until you hit your budget, then add the head.
| |
Why this beats message-count. Messages have wildly variable sizes. A tool result returning 50KB of JSON is one message but eats more tokens than 200 short chat turns. Message-count eviction silently lets one fat tool result push the system prompt out of token budget; token-count eviction sees the size and evicts the fat result. This is the policy every production-grade harness ends up at.
How to count. Use the official tokenizer where available. Anthropic exposes a free count_tokens endpoint that accepts the exact message payload and returns the exact token count Anthropic will charge for. OpenAI ships tiktoken. Approximating with len(text) // 4 is fine for budget estimates but unsafe for hard cutoffs — overshoot the limit by 100 tokens because your estimate was off and the call fails. The cost of one tokenizer round-trip is dwarfed by the cost of one failed request.
Policy 4: Salience-based eviction
Instead of evicting the middle uniformly, drop messages tagged as low-importance first. Importance can come from a write-time classifier (“this is a routine clarification”) or from a learned salience score (lifted from the episodic-memory literature; the Generative Agents importance score is the canonical formulation). The classifier returns a number per message; the eviction pass drops the lowest-scored messages first until budget fits.
When this pays off. Long-running agents (research, coding, multi-day workflows) where some intermediates are load-bearing and others are scratch. Pays the cost of a per-message importance call to get back token budget at the eviction step.
When this hurts. Short sessions where the classifier cost exceeds the value. A salience pass that calls the model is 100ms+ per message; for a 5-turn customer-support session you’ve spent more than the budget you saved.
Policy 5: Summarization-and-replace (preemptive compaction)
When token usage crosses a watermark (say 60% of the model’s context window), pause, summarize the oldest M messages into a single “summary” message, and replace them. The buffer keeps the system prompt + summary + recent tail. Repeat on every watermark crossing.
| |
This is a summarization policy; the deep version belongs in the later context-compression article in this Memory subtree. For short-term memory the relevant point: summarization is the more aggressive sibling of plain truncation. It preserves more semantic content per token but burns model calls to do so, and the summary itself is a lossy compression that you cannot undo. Run it when you have an actual reason to compress (long sessions, tight model budget); skip it when you don’t (short sessions where simple truncation works).
The frontier frameworks have all converged on this pattern as the production default. The OpenAI Agents SDK ships OpenAIResponsesCompactionSession as a decorator over any session backend, with a should_trigger_compaction hook that fires on a budget threshold. Anthropic exposes both server-side context editing with a trigger parameter (the model handles eviction transparently with the clear_tool_uses strategy) and an SDK-side compaction feature that summarizes the buffer when token usage hits the trigger. LangGraph composes the same idea with trim_messages plus a summarize_node you add to your graph. Different APIs, same algorithm.
Headroom budgeting: the rule that’s never optional
The single most violated rule in real harnesses: never fill the context window to its stated limit, because the model has to output tokens too. A Claude Opus 4.7 1M-token window does not mean you can pack 1M input tokens; it means input + output must total ≤ 1M. If you pack 999K of input, the model has 1K to respond with, and any response longer than that gets truncated mid-token, often in the middle of a tool-call JSON, which breaks downstream parsing.
The defensible formula:
| |
For a 200K-token Claude Sonnet 4.6 window with max_tokens=4096, a sane setup is:
max_reply_tokens= 4096 (matchesmax_tokens)safety_headroom= 2048 (~1% buffer for estimation drift)tool_result_headroom= 8192 (if this turn might call tools)- →
input_budget= 200,000 - 4,096 - 2,048 - 8,192 = 185,664 tokens
When the buffer crosses input_budget, evict or compact. When it crosses the hard limit, the call fails. The watermark is the place to act; the limit is the place to panic.
The lost-in-the-middle effect: why “drop the middle” is dangerous beyond a point
A 2023 paper from Nelson Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang — Lost in the Middle: How Language Models Use Long Contexts — quantified what most engineers had a gut feel for: model accuracy on retrieval-from-context tasks follows a U-shaped curve over the position of the relevant content. Performance is highest when the relevant content is at the beginning (primacy bias) or end (recency bias) of the prompt, and degrades by 20–30%+ when the relevant content is in the middle. The follow-up work Found in the Middle attributes the curve to positional-encoding decay in RoPE-based architectures, which is most modern LLMs.
Two design implications for short-term memory.
First, putting the recently-evicted summary in the middle is consistent with the model’s natural attention pattern — the head and tail get the most attention anyway, so the compressed middle isn’t a privileged slot. That’s the good news; it means head-tail eviction with a middle summary is well-aligned with how the model reads.
Second, the more you stuff into the middle, the less reliably the model reads it. Some teams “compress” by leaving lots of low-importance middle material in and trusting the model to filter. The model does not reliably filter. If you wouldn’t bet the agent’s behavior on a particular middle message being read, evict it. The middle is not a free zone; it’s an attention-degraded zone.
Code: a token-budget head-tail buffer in Python
The smallest production-shaped buffer manager. Uses the Anthropic SDK for both the model call and (importantly) the official count_tokens endpoint so the budget math matches what the model actually charges. Install: pip install anthropic.
| |
Four things worth pointing out. First, the system prompt is the API’s system parameter, not message index 0 — both Anthropic and OpenAI separate it out, so eviction logic never has to worry about accidentally dropping it. Second, count_tokens is the authoritative source of truth — using a local estimator (len(text) / 4) is fine for development but a recipe for production context_length_exceeded errors. Third, the task anchor is the first user message and is added last, not first — if it doesn’t fit, it falls back to a short summary line rather than evicting recent turns; “preserve the anchor” must not silently become “preserve nothing else.” Fourth, tool_use/tool_result pairs travel atomically — never break a pair, the model crashes hard on orphan tool messages. This is the same invariant Anthropic’s server-side clear_tool_uses strategy maintains.
The code is deliberately framework-light. A real LangGraph deployment would lean on trim_messages; an Agents SDK deployment would use OpenAIResponsesCompactionSession and let the runner handle the watermark. The principles are the same: a budget, a watermark, head-tail preservation, atomic pairs, authoritative token counting.
Code: the same buffer in TypeScript with LangGraph
The TypeScript version delegates the buffer math to LangGraph and trimMessages from @langchain/core. Install: npm install @langchain/langgraph @langchain/anthropic @langchain/core.
| |
The framework version is shorter because trimMessages already encodes the policies — strategy: "last" is the tail-preserving sliding window, includeSystem: true is the system-prompt protection rule, startOn: "human" is the tool-pair invariant, allowPartial: false is the “never split a message” guarantee. getNumTokensFromMessages delegates to the bound model’s tokenizer (Anthropic’s, in this case) for authoritative counts. A real production setup would add a summarize_node that fires when messages.length or token-count crosses a higher watermark; we’ll work that pattern in the context-compression article later in the Memory subtree.
Trade-offs, failure modes, and gotchas
The orphan-tool-message bug. The most common production failure I see in unaudited harnesses. The buffer evicts a tool_use message but keeps its tool_result, or vice versa. The next API call returns a 400 with a message that looks like an SDK bug but is a buffer bug. Every eviction policy must treat tool-call/tool-result as atomic pairs; LangGraph’s trimMessages does this with startOn: "human"; OpenAI Agents SDK’s compaction wrappers do it implicitly; a hand-rolled harness has to do it explicitly. Audit your harness specifically for this when adding tool use.
The system-prompt-truncation bug. A close second. Some frameworks let you put the system prompt inside the message list (index 0, role "system"); a naive head-tail trim drops it as “an old message.” Always put the system content in the API’s dedicated system field (Anthropic) or use a framework setting that pins it (LangChain’s includeSystem=True). The hook-anecdote bug is this one.
The token-estimator drift bug. The harness uses len(text) // 4 to count tokens and the actual token count is 4.7×. Budget says 180K tokens fit; actual count is 192K; call fails. The fix is the official tokenizer for the model you’re using — Anthropic’s count_tokens API endpoint, OpenAI’s tiktoken, Google’s count_tokens on the Gemini SDK. The local estimator is fine for development and for triggering re-evaluation; the authoritative count is what you make the final eviction decision on.
The retry-doesn’t-retry bug. When a call fails with context_length_exceeded, the harness catches the error and retries with the same buffer — the eviction logic ran before the call, the call failed, and the retry hits the same eviction. The fix is to retry with a tighter budget — drop another 10% of token budget on retry, evict more aggressively, then try again. Single-shot eviction with a hard ceiling is brittle; budget-step-down on retry is robust.
The “I’ll just use the 1M-token window” anti-pattern. Anthropic’s 1M-token context window for Claude Opus 4.7 and Sonnet 4.6 is real and works, but it does not eliminate the need for short-term memory management. Three reasons. First, cost scales with input tokens — running every call at 800K input tokens is 200× more expensive than running it at 4K, for whatever marginal benefit. Second, latency scales with input tokens — prefill time for a 1M-token prompt is measured in seconds. Third, lost-in-the-middle still applies at 1M tokens — accuracy on retrieval-from-context tasks degrades for middle positions whether the window is 8K or 1M. A larger window is a more forgiving working set, not a reason to skip eviction.
Watermarks above 100% are a logic error. Setting trigger=200_000 on a 200K-token model means the compaction never fires until the call itself fails. Always set the watermark below the input budget (60–80% is typical). The point of the watermark is to evict before failure, not at the moment of failure.
Session-end is the natural compaction boundary, not turn-end. Compacting after every assistant turn burns model calls for no behavioral benefit; the buffer is already in a sane state. Compaction earns its keep when the buffer grows past the watermark, not on a fixed cadence. The OpenAI Agents SDK’s should_trigger_compaction hook lets you encode this; LangGraph’s pattern is “run trim_messages every invoke (cheap), run summarization-node only when state size crosses threshold (expensive).” Pay the cheap pass always; pay the expensive pass only when it matters.
Multi-session continuity is not short-term memory. If you want the assistant to remember user-preferences from yesterday’s session, that’s a job for the long-term memory tier — semantic or episodic, depending on the use case — and you read from it at turn-start and inject into the system prompt. Trying to do that with short-term memory by “keeping the buffer alive across sessions” is the most expensive way to solve a problem the memory stack has a cheaper layer for. Short-term is short.
Further reading
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) — the paper that quantified the U-shaped attention curve. Read §3 (multi-document QA results) and §4 (key-value retrieval results) for the numbers that should inform every “drop the middle” eviction decision. The follow-up Found in the Middle (2024) attributes the curve to RoPE positional encoding decay.
- Anthropic — Context editing — the official docs for server-side eviction, including the
clear_tool_uses_20250919strategy that handles tool-pair atomicity automatically and thetrigger/keep/clear_at_leastknobs. The cleanest mental model of how a server-side compaction actually works in production. - OpenAI Cookbook — Short-Term Memory Management with Sessions — a walkthrough of the Agents SDK’s session model, including
OpenAIResponsesCompactionSessionand theshould_trigger_compactionhook. The Cookbook is the most current entry point because it tracks SDK changes faster than the long-form docs. - LangGraph — Add memory — the LangGraph view of short-term memory: checkpointers as the persistence layer,
trim_messagesas the eviction policy, summarize-node as the compaction layer. The pattern-language is opinionated but the patterns are good.
What to read next
- Working Memory: Scratchpads, Blackboards, and Agent Notebooks — the immediate sequel and the second half of the in-context tier. Where this article treats the conversation buffer as the implicit message-shaped IO log, the next one covers the explicit, structured scratchpad the harness maintains separately — typed state objects, external notebook tools, shared blackboards.
- The Cognitive Taxonomy: Semantic, Episodic, Procedural — the upstream piece. Short-term memory is the L1 cache of the four-tier hierarchy; this article specialized that role into a concrete buffer-management policy.
- The Memory Stack: A Map of AI Memory — the parent article. Short-term memory is the in-context tier; the storage and write-path tiers come next in the Memory subtree.
- Prompt Caching: Reusing the KV Cache Across Calls — the natural cost optimization once your buffer is stable. Cache-aware eviction (keep the cacheable prefix intact while evicting from the tail) is the next refinement; if your buffer manager destroys the prefix on every turn, prompt caching never fires.