jatin.blog ~ $
$ cat ai-engineering/short-term-memory.md

Short-Term Memory: Managing the Conversation Buffer

Truncation policies for the LLM conversation buffer: sliding windows, token-level vs message-level eviction, system-prompt protection, headroom budgeting.

Jatin Bansal@blog:~/ai-engineering$ open short-term-memory

A coding agent’s session crosses turn 60 and the next model call returns context_length_exceeded. The harness catches the error, drops the first 30 messages, retries, and the call succeeds — except the dropped messages contained the system prompt and the user’s original task description. The model now happily continues whatever it remembers from the last 30 turns, which means it answers the question on screen but has forgotten why the user started this session in the first place. The agent’s behavior visibly drifts. No exception is raised. No alert fires. This is the failure mode short-term memory exists to prevent: the conversation buffer is the only thing standing between “the model has working memory” and “the model has amnesia at every API boundary,” and getting its eviction policy wrong is the single most common production bug in agent harnesses.

Opening bridge

The last two pieces drew the memory stack map and then went deep on the four cognitive memory types. Both flagged short-term memory — the conversation buffer, the in-context tier — as the layer that the rest of the stack hangs off. The reason is mechanical: every byte that reaches the model travels through this buffer. The episodic store can have a million entries; if your buffer eviction drops the message that triggered the retrieval, the retrieval was for nothing. The semantic store can pin the user’s name; if your buffer truncation cuts the system prompt that injects the pinned facts, the model never sees them. Short-term memory is the gatekeeper tier — the one that decides what crosses the API boundary on the next call — and that’s why it gets its own article before the more glamorous long-term layers.

Definition

Short-term memory is the deliberately bounded, in-context message list that the harness assembles and sends to the model on every turn. Three properties make it short-term-specific. First, it lives across turns within a single logical session but evaporates across sessions by default. Second, its size is bounded by something — token count, message count, or a budget the harness chose — and the harness has to make explicit eviction decisions when the next turn would breach the bound. Third, its contents are the prompt the model actually sees, which means every eviction decision is also an attention decision: the model can only reason about bytes that survived the eviction pass.

What short-term memory is not, in this taxonomy: it is not the entire conversation history stored in your database. The full history is the log; short-term memory is the working set the harness reconstructs from the log for the next call. Conflating the two is the first design mistake. The log can grow unbounded; the working set must not.

Intuition

The mental model that pays off is a bounded ring buffer with an attention-weighted hot section in the middle. The system prompt is the head, pinned and immutable. The most recent few turns are the tail, freshest and most relevant. Everything in between is contested space — the model attends to it less reliably (the lost-in-the-middle effect we’ll get to), and it’s also the easiest material to evict without obvious quality loss. The harness’s job each turn is to slide that ring forward by N tokens, decide which slots to overwrite, and keep at least max_tokens_for_reply of headroom so the model has room to actually respond.

The cleanest analogy from systems is the TCP receive buffer with explicit headroom reservation. The receiver advertises a window size that includes a deliberate gap between current buffer occupancy and the actual buffer limit; without that gap, the next packet wedges the buffer at zero free space and the connection stalls. A conversation buffer that fills to the model’s stated context limit will fail the next call, because the model needs free space to emit tokens. Always reserve headroom; treat the context window as total - headroom, not as total.

The distributed-systems parallel

Three honest parallels worth naming explicitly.

LRU cache replacement is the canonical eviction policy and it’s also wrong as a default for conversations. LRU treats the buffer as a hot-cold gradient where the oldest entry is the coldest. Conversations don’t have that shape — the first user message (which sets the task) is older than every subsequent turn but often more important than any of them. A plain LRU eviction silently deletes the task description first; that’s the bug in the hook. The conversations equivalent of “pin the page” is the system-prompt-and-task-anchor protection rule we’ll formalize below: certain entries are never evictable regardless of age. Once that rule is in place, LRU works fine for the rest.

Headroom-budgeting is the same problem as the JVM’s -Xmx minus Survivor + Eden reservation, or the database connection-pool “leave N for emergency queries” pattern. The pattern: never let the working set use 100% of available capacity, because the act of producing the next unit of work itself requires capacity, and a buffer at 100% utilization deadlocks. The Anthropic SDK’s server-side context editing calls this the trigger threshold; the OpenAI Agents SDK’s OpenAIResponsesCompactionSession exposes a similar should_trigger_compaction hook. Naming notwithstanding, it’s the same trick: maintain a watermark below the hard ceiling, evict on watermark crossing, leave the gap.

The conversation buffer is a log compaction problem, not a queue problem. Queues are FIFO; you drop from the head. The conversation buffer drops from the middle, preserving both the head (the task) and the tail (the recent turns). The closest match in databases is Kafka log compaction: the log keeps the latest value per key, drops older versions of the same key, and never drops the schema header. The conversation parallel is exact — you keep the task anchor, the latest tool result per tool, the recent N user/assistant turns, and you compact out the older redundant intermediates. The detailed compaction subtree article (coming later in this Memory subtree) will return to this; for short-term memory the relevant takeaway is that “evict the oldest” is the wrong primitive — “evict the least load-bearing middle entries” is the right one.

The mechanics: what’s in the buffer

Every turn, the harness reconstructs a message array of the shape every modern Chat-Completions-style API accepts:

  1. System prompt(s). One or more, but always at index 0. Sets behavior, injects pinned semantic facts (from the semantic memory tier), declares tools, declares output schemas.
  2. Task anchor. The first user message of the session — the one that defines what the agent is trying to accomplish. Treated as protected in any non-trivial harness.
  3. Retrieved context. RAG hits, memory-store retrievals, tool definitions injected just-in-time. Lives near the system prompt or just before the latest user turn, depending on the JIT-vs-AOT context engineering policy.
  4. Conversation tail. The most recent N user/assistant turns and any tool-call/tool-result pairs they generated.
  5. The current user turn. Always at the end. Always protected.

Eviction operates on the middle: the conversation between the task anchor and the recent tail. Within that middle, the harness chooses what to drop. Below are the policies that actually ship.

Policy 1: Hard sliding window

Keep only the last K messages. Drop the oldest. Linear, simple, and the default in nearly every framework’s getting-started tutorial.

text
1
buffer = system_prompts + history[-K:]

When this is right. Single-purpose assistants with tight token budgets where the next turn is overwhelmingly the most important context. Customer-support bots that handle one issue per session. Single-prompt chat UIs.

When this is wrong. Any session where the original user message matters past turn K. Coding agents (the task description matters at turn 200). Multi-step research agents (the original question matters even when the current step is debugging the JSON parser). The fix is the next policy.

Policy 2: Head-tail (task-anchored sliding window)

Keep the system prompt, keep the first user message, keep the last K messages. Drop the middle.

text
1
buffer = system_prompts + history[:1] + history[-K:]

When this is right. Any agent whose system prompt is the single most important context (almost all of them) and whose task description is the second most important context (most of them). This is the policy LangGraph’s trim_messages with include_system=True and strategy="last" defaults to.

The subtle bug. Hard cuts at the boundary between “first message” and “last K messages” can break message-pair integrity — you can end up with a tool result for a tool call that’s no longer in the buffer. Every production buffer manager must respect the tool-call/tool-result pairing invariant: if you drop a tool call, drop its result; if you drop a result, drop its call. Otherwise the model sees orphan messages and behavior gets weird.

Policy 3: Token-budget eviction

Same head-tail shape, but measured in tokens instead of messages. Count tokens for each message; keep adding from the tail backwards until you hit your budget, then add the head.

text
1
2
3
4
5
6
7
8
budget = model_context_window - max_reply_tokens - headroom
buffer = system_prompts + history[:1]
remaining = budget - tokens(buffer)
for msg in reversed(history[1:]):
    if tokens(msg) <= remaining:
        buffer.append(msg); remaining -= tokens(msg)
    else:
        break

Why this beats message-count. Messages have wildly variable sizes. A tool result returning 50KB of JSON is one message but eats more tokens than 200 short chat turns. Message-count eviction silently lets one fat tool result push the system prompt out of token budget; token-count eviction sees the size and evicts the fat result. This is the policy every production-grade harness ends up at.

How to count. Use the official tokenizer where available. Anthropic exposes a free count_tokens endpoint that accepts the exact message payload and returns the exact token count Anthropic will charge for. OpenAI ships tiktoken. Approximating with len(text) // 4 is fine for budget estimates but unsafe for hard cutoffs — overshoot the limit by 100 tokens because your estimate was off and the call fails. The cost of one tokenizer round-trip is dwarfed by the cost of one failed request.

Policy 4: Salience-based eviction

Instead of evicting the middle uniformly, drop messages tagged as low-importance first. Importance can come from a write-time classifier (“this is a routine clarification”) or from a learned salience score (lifted from the episodic-memory literature; the Generative Agents importance score is the canonical formulation). The classifier returns a number per message; the eviction pass drops the lowest-scored messages first until budget fits.

When this pays off. Long-running agents (research, coding, multi-day workflows) where some intermediates are load-bearing and others are scratch. Pays the cost of a per-message importance call to get back token budget at the eviction step.

When this hurts. Short sessions where the classifier cost exceeds the value. A salience pass that calls the model is 100ms+ per message; for a 5-turn customer-support session you’ve spent more than the budget you saved.

Policy 5: Summarization-and-replace (preemptive compaction)

When token usage crosses a watermark (say 60% of the model’s context window), pause, summarize the oldest M messages into a single “summary” message, and replace them. The buffer keeps the system prompt + summary + recent tail. Repeat on every watermark crossing.

text
1
2
3
if tokens(buffer) > watermark:
    summary = summarize(buffer[1:-keep_tail])
    buffer = buffer[:1] + [summary_message(summary)] + buffer[-keep_tail:]

This is a summarization policy; the deep version belongs in the later context-compression article in this Memory subtree. For short-term memory the relevant point: summarization is the more aggressive sibling of plain truncation. It preserves more semantic content per token but burns model calls to do so, and the summary itself is a lossy compression that you cannot undo. Run it when you have an actual reason to compress (long sessions, tight model budget); skip it when you don’t (short sessions where simple truncation works).

The frontier frameworks have all converged on this pattern as the production default. The OpenAI Agents SDK ships OpenAIResponsesCompactionSession as a decorator over any session backend, with a should_trigger_compaction hook that fires on a budget threshold. Anthropic exposes both server-side context editing with a trigger parameter (the model handles eviction transparently with the clear_tool_uses strategy) and an SDK-side compaction feature that summarizes the buffer when token usage hits the trigger. LangGraph composes the same idea with trim_messages plus a summarize_node you add to your graph. Different APIs, same algorithm.

Headroom budgeting: the rule that’s never optional

The single most violated rule in real harnesses: never fill the context window to its stated limit, because the model has to output tokens too. A Claude Opus 4.7 1M-token window does not mean you can pack 1M input tokens; it means input + output must total ≤ 1M. If you pack 999K of input, the model has 1K to respond with, and any response longer than that gets truncated mid-token, often in the middle of a tool-call JSON, which breaks downstream parsing.

The defensible formula:

text
1
2
3
4
input_budget = context_window
               - max_reply_tokens         # what you set in max_tokens
               - safety_headroom          # for tokenizer estimation error
               - tool_result_headroom     # tools may return mid-call material

For a 200K-token Claude Sonnet 4.6 window with max_tokens=4096, a sane setup is:

  • max_reply_tokens = 4096 (matches max_tokens)
  • safety_headroom = 2048 (~1% buffer for estimation drift)
  • tool_result_headroom = 8192 (if this turn might call tools)
  • input_budget = 200,000 - 4,096 - 2,048 - 8,192 = 185,664 tokens

When the buffer crosses input_budget, evict or compact. When it crosses the hard limit, the call fails. The watermark is the place to act; the limit is the place to panic.

The lost-in-the-middle effect: why “drop the middle” is dangerous beyond a point

A 2023 paper from Nelson Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang — Lost in the Middle: How Language Models Use Long Contexts — quantified what most engineers had a gut feel for: model accuracy on retrieval-from-context tasks follows a U-shaped curve over the position of the relevant content. Performance is highest when the relevant content is at the beginning (primacy bias) or end (recency bias) of the prompt, and degrades by 20–30%+ when the relevant content is in the middle. The follow-up work Found in the Middle attributes the curve to positional-encoding decay in RoPE-based architectures, which is most modern LLMs.

Two design implications for short-term memory.

First, putting the recently-evicted summary in the middle is consistent with the model’s natural attention pattern — the head and tail get the most attention anyway, so the compressed middle isn’t a privileged slot. That’s the good news; it means head-tail eviction with a middle summary is well-aligned with how the model reads.

Second, the more you stuff into the middle, the less reliably the model reads it. Some teams “compress” by leaving lots of low-importance middle material in and trusting the model to filter. The model does not reliably filter. If you wouldn’t bet the agent’s behavior on a particular middle message being read, evict it. The middle is not a free zone; it’s an attention-degraded zone.

Code: a token-budget head-tail buffer in Python

The smallest production-shaped buffer manager. Uses the Anthropic SDK for both the model call and (importantly) the official count_tokens endpoint so the budget math matches what the model actually charges. Install: pip install anthropic.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
import os
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-opus-4-7"

# Window math
CONTEXT_WINDOW = 200_000        # safe cap; raise if using the 1M tier
MAX_REPLY_TOKENS = 4_096
SAFETY_HEADROOM = 2_048
TOOL_HEADROOM = 8_192
INPUT_BUDGET = CONTEXT_WINDOW - MAX_REPLY_TOKENS - SAFETY_HEADROOM - TOOL_HEADROOM

def count_tokens(system: str, messages: list[dict]) -> int:
    """Authoritative token count via Anthropic's count_tokens endpoint."""
    resp = client.messages.count_tokens(
        model=MODEL, system=system, messages=messages,
    )
    return resp.input_tokens

def build_buffer(
    system: str,
    history: list[dict],
    user_msg: str,
    budget: int = INPUT_BUDGET,
) -> list[dict]:
    """
    Head-tail token-budget eviction.
    - System prompt is sent as the API's `system` param, not in messages.
    - The first user message (task anchor) is always preserved if it fits.
    - We grow the tail backwards atomic-pair by atomic-pair until budget breach.
    - tool_use / tool_result pairs are kept atomic (never orphaned).
    """
    current_turn = [{"role": "user", "content": user_msg}]
    if not history:
        return current_turn

    task_anchor = history[0]
    pairs = group_into_atomic_pairs(history[1:])  # everything after the anchor
    tail: list[dict] = []

    # Grow tail backwards, pair-by-pair, while budget allows.
    for pair in reversed(pairs):
        trial = [task_anchor] + flatten([pair]) + tail + current_turn
        if count_tokens(system, trial) > budget:
            break
        tail = flatten([pair]) + tail

    # Try the buffer with the task anchor; fall back to a summary line if too big.
    with_anchor = [task_anchor] + tail + current_turn
    if count_tokens(system, with_anchor) <= budget:
        return with_anchor

    summary = {
        "role": "user",
        "content": "[original task: " + truncate_for_summary(task_anchor["content"]) + "]",
    }
    return [summary] + tail + current_turn

# Atomic pair helpers — tool_use/tool_result must travel together
def group_into_atomic_pairs(history: list[dict]) -> list[list[dict]]:
    """Group adjacent tool_use → tool_result messages into single atomic units."""
    pairs, i = [], 0
    while i < len(history):
        msg = history[i]
        # Heuristic: an assistant turn that triggers a tool gets paired
        # with the next user message that holds the tool_result.
        if (msg["role"] == "assistant" and is_tool_use(msg)
                and i + 1 < len(history) and is_tool_result(history[i + 1])):
            pairs.append([msg, history[i + 1]])
            i += 2
        else:
            pairs.append([msg])
            i += 1
    return pairs

def flatten(pairs: list[list[dict]]) -> list[dict]:
    return [m for pair in pairs for m in pair]

def is_tool_use(msg: dict) -> bool:
    c = msg.get("content")
    if isinstance(c, list):
        return any(b.get("type") == "tool_use" for b in c)
    return False

def is_tool_result(msg: dict) -> bool:
    c = msg.get("content")
    if isinstance(c, list):
        return any(b.get("type") == "tool_result" for b in c)
    return False

def truncate_for_summary(content) -> str:
    text = content if isinstance(content, str) else str(content)
    return text[:200] + ("…" if len(text) > 200 else "")

# Demo: short-term memory turn
def turn(system: str, history: list[dict], user_msg: str) -> str:
    messages = build_buffer(system, history, user_msg)
    resp = client.messages.create(
        model=MODEL,
        max_tokens=MAX_REPLY_TOKENS,
        system=system,
        messages=messages,
    )
    text = "".join(b.text for b in resp.content if b.type == "text")
    history.append({"role": "user", "content": user_msg})
    history.append({"role": "assistant", "content": text})
    return text

Four things worth pointing out. First, the system prompt is the API’s system parameter, not message index 0 — both Anthropic and OpenAI separate it out, so eviction logic never has to worry about accidentally dropping it. Second, count_tokens is the authoritative source of truth — using a local estimator (len(text) / 4) is fine for development but a recipe for production context_length_exceeded errors. Third, the task anchor is the first user message and is added last, not first — if it doesn’t fit, it falls back to a short summary line rather than evicting recent turns; “preserve the anchor” must not silently become “preserve nothing else.” Fourth, tool_use/tool_result pairs travel atomically — never break a pair, the model crashes hard on orphan tool messages. This is the same invariant Anthropic’s server-side clear_tool_uses strategy maintains.

The code is deliberately framework-light. A real LangGraph deployment would lean on trim_messages; an Agents SDK deployment would use OpenAIResponsesCompactionSession and let the runner handle the watermark. The principles are the same: a budget, a watermark, head-tail preservation, atomic pairs, authoritative token counting.

Code: the same buffer in TypeScript with LangGraph

The TypeScript version delegates the buffer math to LangGraph and trimMessages from @langchain/core. Install: npm install @langchain/langgraph @langchain/anthropic @langchain/core.

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import { ChatAnthropic } from "@langchain/anthropic";
import {
  HumanMessage,
  AIMessage,
  SystemMessage,
  trimMessages,
  BaseMessage,
} from "@langchain/core/messages";
import {
  StateGraph,
  MemorySaver,
  Annotation,
  END,
} from "@langchain/langgraph";

const MODEL = "claude-opus-4-7";
const model = new ChatAnthropic({ model: MODEL });

// Window math (same as Python version)
const CONTEXT_WINDOW = 200_000;
const MAX_REPLY_TOKENS = 4_096;
const SAFETY_HEADROOM = 2_048;
const TOOL_HEADROOM = 8_192;
const INPUT_BUDGET =
  CONTEXT_WINDOW - MAX_REPLY_TOKENS - SAFETY_HEADROOM - TOOL_HEADROOM;

// Use Anthropic's count_tokens via the LangChain wrapper.
// (LangChain's BaseMessage tokens method delegates to the bound model.)
const tokenCounter = async (msgs: BaseMessage[]): Promise<number> =>
  await model.getNumTokensFromMessages(msgs).then((r) => r.totalCount);

// Head-tail trim: keep system, keep first user message, keep tail to fit budget.
const trim = async (history: BaseMessage[]): Promise<BaseMessage[]> =>
  trimMessages(history, {
    maxTokens: INPUT_BUDGET,
    tokenCounter,
    strategy: "last",         // evict from oldest first (after the head)
    includeSystem: true,      // never drop SystemMessage
    startOn: "human",         // keep tool_use/tool_result pairs intact
    allowPartial: false,      // never split a single message
  });

const StateAnnotation = Annotation.Root({
  messages: Annotation<BaseMessage[]>({
    reducer: (a, b) => [...a, ...b],
    default: () => [],
  }),
});

const callModel = async (state: typeof StateAnnotation.State) => {
  const trimmed = await trim(state.messages);
  const resp = await model.invoke(trimmed);
  return { messages: [resp] };
};

const graph = new StateGraph(StateAnnotation)
  .addNode("model", callModel)
  .addEdge("__start__", "model")
  .addEdge("model", END)
  .compile({ checkpointer: new MemorySaver() });

// Demo
const system = new SystemMessage(
  "You are a coding assistant. Help the user incrementally. " +
    "When asked to recall what we were building, refer to the original task.",
);

await graph.invoke(
  {
    messages: [
      system,
      new HumanMessage("Build me a CLI that fetches RSS feeds and writes a digest."),
    ],
  },
  { configurable: { thread_id: "session-7" } },
);

// 200 turns later — the trim runs on every invoke; the original task survives
// because `strategy: "last"` + `includeSystem: true` + LangGraph keeping the
// first HumanMessage prioritized.
await graph.invoke(
  { messages: [new HumanMessage("Remind me what we were trying to build.")] },
  { configurable: { thread_id: "session-7" } },
);

The framework version is shorter because trimMessages already encodes the policies — strategy: "last" is the tail-preserving sliding window, includeSystem: true is the system-prompt protection rule, startOn: "human" is the tool-pair invariant, allowPartial: false is the “never split a message” guarantee. getNumTokensFromMessages delegates to the bound model’s tokenizer (Anthropic’s, in this case) for authoritative counts. A real production setup would add a summarize_node that fires when messages.length or token-count crosses a higher watermark; we’ll work that pattern in the context-compression article later in the Memory subtree.

Trade-offs, failure modes, and gotchas

The orphan-tool-message bug. The most common production failure I see in unaudited harnesses. The buffer evicts a tool_use message but keeps its tool_result, or vice versa. The next API call returns a 400 with a message that looks like an SDK bug but is a buffer bug. Every eviction policy must treat tool-call/tool-result as atomic pairs; LangGraph’s trimMessages does this with startOn: "human"; OpenAI Agents SDK’s compaction wrappers do it implicitly; a hand-rolled harness has to do it explicitly. Audit your harness specifically for this when adding tool use.

The system-prompt-truncation bug. A close second. Some frameworks let you put the system prompt inside the message list (index 0, role "system"); a naive head-tail trim drops it as “an old message.” Always put the system content in the API’s dedicated system field (Anthropic) or use a framework setting that pins it (LangChain’s includeSystem=True). The hook-anecdote bug is this one.

The token-estimator drift bug. The harness uses len(text) // 4 to count tokens and the actual token count is 4.7×. Budget says 180K tokens fit; actual count is 192K; call fails. The fix is the official tokenizer for the model you’re using — Anthropic’s count_tokens API endpoint, OpenAI’s tiktoken, Google’s count_tokens on the Gemini SDK. The local estimator is fine for development and for triggering re-evaluation; the authoritative count is what you make the final eviction decision on.

The retry-doesn’t-retry bug. When a call fails with context_length_exceeded, the harness catches the error and retries with the same buffer — the eviction logic ran before the call, the call failed, and the retry hits the same eviction. The fix is to retry with a tighter budget — drop another 10% of token budget on retry, evict more aggressively, then try again. Single-shot eviction with a hard ceiling is brittle; budget-step-down on retry is robust.

The “I’ll just use the 1M-token window” anti-pattern. Anthropic’s 1M-token context window for Claude Opus 4.7 and Sonnet 4.6 is real and works, but it does not eliminate the need for short-term memory management. Three reasons. First, cost scales with input tokens — running every call at 800K input tokens is 200× more expensive than running it at 4K, for whatever marginal benefit. Second, latency scales with input tokens — prefill time for a 1M-token prompt is measured in seconds. Third, lost-in-the-middle still applies at 1M tokens — accuracy on retrieval-from-context tasks degrades for middle positions whether the window is 8K or 1M. A larger window is a more forgiving working set, not a reason to skip eviction.

Watermarks above 100% are a logic error. Setting trigger=200_000 on a 200K-token model means the compaction never fires until the call itself fails. Always set the watermark below the input budget (60–80% is typical). The point of the watermark is to evict before failure, not at the moment of failure.

Session-end is the natural compaction boundary, not turn-end. Compacting after every assistant turn burns model calls for no behavioral benefit; the buffer is already in a sane state. Compaction earns its keep when the buffer grows past the watermark, not on a fixed cadence. The OpenAI Agents SDK’s should_trigger_compaction hook lets you encode this; LangGraph’s pattern is “run trim_messages every invoke (cheap), run summarization-node only when state size crosses threshold (expensive).” Pay the cheap pass always; pay the expensive pass only when it matters.

Multi-session continuity is not short-term memory. If you want the assistant to remember user-preferences from yesterday’s session, that’s a job for the long-term memory tier — semantic or episodic, depending on the use case — and you read from it at turn-start and inject into the system prompt. Trying to do that with short-term memory by “keeping the buffer alive across sessions” is the most expensive way to solve a problem the memory stack has a cheaper layer for. Short-term is short.

Further reading

  • Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) — the paper that quantified the U-shaped attention curve. Read §3 (multi-document QA results) and §4 (key-value retrieval results) for the numbers that should inform every “drop the middle” eviction decision. The follow-up Found in the Middle (2024) attributes the curve to RoPE positional encoding decay.
  • Anthropic — Context editing — the official docs for server-side eviction, including the clear_tool_uses_20250919 strategy that handles tool-pair atomicity automatically and the trigger/keep/clear_at_least knobs. The cleanest mental model of how a server-side compaction actually works in production.
  • OpenAI Cookbook — Short-Term Memory Management with Sessions — a walkthrough of the Agents SDK’s session model, including OpenAIResponsesCompactionSession and the should_trigger_compaction hook. The Cookbook is the most current entry point because it tracks SDK changes faster than the long-form docs.
  • LangGraph — Add memory — the LangGraph view of short-term memory: checkpointers as the persistence layer, trim_messages as the eviction policy, summarize-node as the compaction layer. The pattern-language is opinionated but the patterns are good.
  • Working Memory: Scratchpads, Blackboards, and Agent Notebooks — the immediate sequel and the second half of the in-context tier. Where this article treats the conversation buffer as the implicit message-shaped IO log, the next one covers the explicit, structured scratchpad the harness maintains separately — typed state objects, external notebook tools, shared blackboards.
  • The Cognitive Taxonomy: Semantic, Episodic, Procedural — the upstream piece. Short-term memory is the L1 cache of the four-tier hierarchy; this article specialized that role into a concrete buffer-management policy.
  • The Memory Stack: A Map of AI Memory — the parent article. Short-term memory is the in-context tier; the storage and write-path tiers come next in the Memory subtree.
  • Prompt Caching: Reusing the KV Cache Across Calls — the natural cost optimization once your buffer is stable. Cache-aware eviction (keep the cacheable prefix intact while evicting from the tail) is the next refinement; if your buffer manager destroys the prefix on every turn, prompt caching never fires.