jatin.blog ~ $
$ cat ai-engineering/conversation-compaction.md

Conversation Compaction: Keeping Long Sessions Alive

Conversation compaction in long agent sessions: reactive vs preemptive triggers, cache-aware deletion, circuit breakers, snapshot-rollback, journals.

Jatin Bansal@blog:~/ai-engineering$ open conversation-compaction

A coding agent has been running for six hours on a database migration. Token count: 740,000 of an 800,000-token budget. The harness counts tokens before every call, but the counter has been quietly off by 2% because the tokenizer shipped in the SDK lags the production tokenizer by one version. The agent emits a long tool-result; the actual input exceeds the model’s hard limit; the call returns context_length_exceeded. The harness’s compaction pass fires reactively, sends a 740k-token request to a small summarizer, also gets context_length_exceeded because the summarizer’s window is smaller than the foreground model’s, and the session is wedged: it cannot proceed without compacting, and it cannot compact without proceeding. On-call gets paged at 3am. The lesson is not “we should have compacted earlier” — it is that compaction is the only operation in a long agent session that, when it fails, leaves no path forward, and so its orchestration is the most safety-critical surface in the harness.

Opening bridge

Yesterday’s piece on the agent harness named seven duties and noted that “a summarizer that rewrites the prefix invalidates every downstream cache.” That sentence was a placeholder for an engineering surface; today we open it. The context-compression article covered the mechanics of summarization — recursive, structured, verbatim, opaque — and how to measure compression-induced quality loss. This piece is the complementary half: harness-level orchestration of when to fire compaction in a long session, how to fire it without breaking the prompt cache, what to do when it fails, and whether you should be compacting at all versus running an append-only memory journal that sidesteps the operation. The two together are the full conversation-compaction picture; this one closes the Agents subtree.

Definition

Conversation compaction is the harness operation that, when triggered, reduces the size of the live conversation buffer in place so that the next model call fits. Three properties distinguish conversation compaction in a long session from the broader compression operation. It is in-place mutation of the live buffer, not an offline pass — the next call must succeed against the new buffer. It is non-optional once triggered: there is no path forward if it fails. And it is cache-disruptive by default: any rewrite invalidates the prompt-cache prefix from the rewrite boundary onward, so a naive implementation pays full prefill on every subsequent call. The orchestration question is how to fire it as rarely and cheaply as possible, and how to keep the session alive when it fails.

The distributed-systems parallels

Three load-bearing parallels, each different from the parallels in the compression article. That article drew log-compaction-as-mechanism; this one draws log-compaction-as-orchestration, generational garbage collection, and circuit breakers around a single point of failure.

Log compaction as orchestration. Kafka’s log compaction runs on a background thread, operates on a separate copy of the segment, swaps the new segment in atomically once written, and rolls back on failure rather than leaving a half-compacted segment in place. The conversation analogue is exact: run the summarizer on a separate model call, write to a staging slot, validate, then atomically replace the buffer span. A naive implementation that streams the summarizer’s output into the live buffer and then prunes leaves the system in a broken state if the summarizer truncates, errors mid-stream, or returns malformed JSON.

Generational garbage collection is the closest match for when compaction fires. The JVM’s G1 collector runs young-generation GC frequently (fast) and full GC rarely (stops the world). The same shape applies here: micro-compaction (drop redundant tool results, collapse repeated reads of the same file) runs often and is cheap; full compaction (summarize the middle of the buffer) runs rarely and is expensive. Claude Code’s harness ships exactly this distinction — micro-compact runs at ~60-70% utilization and selectively clears tool outputs; full auto-compact runs at ~95%. A harness with only the full-GC equivalent runs the expensive pass too often; one with only the young-gen equivalent runs out of context before the full pass fires.

Circuit breakers around a single point of failure. Long-running databases wrap their checkpointing thread in a watchdog: if the checkpoint hangs, the watchdog kills it and triggers a fallback. Conversation compaction needs the same discipline because — as the opening anecdote showed — a compaction pass that fails repeatedly takes the whole session down with it. The circuit-breaker pattern from the tool-use article ports over: count failures, trip after N, fall back to a degraded strategy. A 2026.3.2 bug report against a major coding-agent product describes exactly this failure — compaction timeouts deadlocked the session because there was no breaker; the user couldn’t even run /new because queued commands sat behind the timed-out compaction. The fix wasn’t to make compaction faster; the fix was to add the breaker that should have been there from day one.

Reactive vs preemptive triggers

Two philosophies, and most production harnesses end up running both.

Reactive compaction fires after the buffer crosses a watermark. Anthropic’s compaction beta is the cleanest example — the server detects when input tokens exceed the configured trigger (default 150,000), runs compaction inline, emits a compaction block, and continues. OpenAI’s server-side compaction via the Responses API (shipped February 11, 2026) is conceptually identical: pass context_management.compact_threshold and the server fires compaction when the rendered token count crosses it. Claude Code’s auto-compact at ~95% is the client-side version. The pros are operational simplicity: a single number, a single conditional. The cons: the user-facing turn that crosses the watermark pays 1-5s of compaction latency, and there’s no head-room — if the model emits a 50K-token tool result at 90K of a 100K budget, the compaction must succeed on the first try because there’s no room for retry tokens.

Preemptive compaction fires before the buffer is dangerous, by projecting whether the next turn will breach the budget after the model’s reply. The estimate is the current buffer size plus a conservative max_tokens bound on output plus the expected tool-result size from the most recent tool_use block. The pros: the user-facing turn never gets surprise latency, and there is always head-room for the model to reply. The cons: some compactions are wasted (the prediction was conservative), and the trigger requires a reasonably accurate token estimator plus policy about max output size, which often lives outside the core compaction module.

The defensible production pattern is preemptive as primary, reactive as fallback. Run preemptive at ~70% of effective context; keep reactive as a backstop at ~95% for cases where the preemptive estimator was wrong (which it will be — tool-result sizes are heavy-tailed). This is the generational-GC pattern ported to the trigger surface.

Cache-aware compaction: surgical deletion

The most important property of a production compactor is cache-awareness. Every buffer rewrite invalidates the prompt-cache prefix from the rewrite boundary onward, so a naive compactor that summarizes the entire history into a fresh block pays full prefill on every subsequent turn — roughly 10× the cost (cache reads bill at 10% of base on Anthropic and similar on OpenAI) and seconds of added latency. Cache-aware compaction is the load-bearing optimization that turns long sessions from expensive to feasible.

Three patterns ship in production:

Surgical tool-result deletion, system prompt and assistant turns preserved. The cheapest operation: identify tool-call/tool-result pairs where the result has since been superseded (e.g., five Read calls on the same file — keep the latest, replace the earlier ones with [result superseded; see turn N]). The system prompt, assistant reasoning, and the recent tail are unchanged; the cacheable prefix grows monotonically. This is what Anthropic’s clear_tool_uses_20250919 does at the API level and what Claude Code’s micro-compact does client-side. Compression ratio is modest (10-30%), but the cache hit rate stays high, and the operation can run frequently.

Append-only summary block, untouched tail. When micro-compaction isn’t enough: write a new “session summary” block at a stable position (right after the system prompt), and drop the messages it summarizes. The summary, once written, is treated as immutable for the next K turns — it doesn’t get re-summarized on every turn; the cache treats it as a stable prefix. The conversation tail after the summary keeps growing and benefits from incremental caching. The Anthropic beta’s pause_after_compaction option exists for this pattern: pause after the summary is generated, let the client preserve any instruction-oriented messages, then continue. Compression ratio is high (80-95%); cache invalidation cost is paid once per K turns rather than every turn.

Anti-pattern: summary-as-system-prompt mutation. Some harnesses fold the running summary into the system prompt, mutating it every cycle. This is the worst thing you can do for the cache — the system prompt is the most-cached prefix, and rewriting it every turn recomputes the entire prefix on every turn. The cost graph reads “we turned on caching” but the bill stays flat. The harness anatomy article flagged this; it bears repeating because every team seems to invent it independently.

Rule of thumb: compress as far from the cache-hit prefix as possible; rewrite as little of the prefix as possible; treat the summary block as an immutable record between compactions, not a running state.

Error recovery: circuit breakers and snapshot-rollback

Compaction can fail in five ways: the summarizer errors out (network, rate limit, 5xx); returns malformed JSON; returns valid JSON but with empty load-bearing fields (the worst silent failure); its output exceeds its own context window (the wedged-session bug from the opening); or the summarized buffer still exceeds the foreground model’s limit. Each needs a typed recovery path.

Snapshot-and-rollback for atomic compaction. Before mutating the live buffer, take a snapshot — a deep copy of the message array and the running summary. Run the summarizer against the snapshot, validate the output (JSON parse + schema check + load-bearing-field presence), and only on success atomically swap the new buffer in. On any failure, roll back to the snapshot. The snapshot is the same primitive long-horizon checkpointing uses, applied at a finer grain. The cost is small; the safety benefit is large — compaction never leaves the session half-rewritten.

Circuit breaker around compaction failures. Maintain a per-session counter of consecutive failures. Trip at N=3 (a defensible default; lower if throughput-bounded, higher if the summarizer is flaky). When tripped, do not retry compaction — fall back to a degraded strategy and emit a structured failure. The breaker’s purpose is to break the infinite-failure loop: a session that fails once is recoverable; a session that fails repeatedly while burning the full timeout window each time is a service outage. After M turns of cool-down, reset.

Lossy truncation as the last-resort fallback. When the breaker is tripped or the summarizer is unreachable, fall back to unconditional head-tail truncation: keep the system prompt, keep the most recent K turns, drop everything in the middle. This is the default eviction policy from short-term memory, used here as the fallback when the smarter policy fails. The agent loses semantic context but the session stays alive — and a degraded session is recoverable; a crashed one is not. The fallback should be loud: log the event, expose a UI indicator, mark the failure for downstream review.

The wedged-buffer escape hatch. For the case where the summarizer itself can’t fit the buffer in its context: either keep a secondary summarizer with a larger context window, or run a chunked-and-merge pass that splits the buffer in half, summarizes each half independently, then summarizes the summaries. The chunked-and-merge pattern is the same shape as map-reduce summarization and is the right recovery primitive when the foreground model’s context exceeds the summarizer’s.

Append-only memory journals: the architectural alternative

A radically different design ports the database log-vs-LSM-tree decision: instead of compacting the live buffer, run the live buffer at near-zero retention and write everything important to an external append-only memory journal, queried at retrieval time rather than loaded wholesale.

Mechanically: every turn, before the model call, the harness extracts decisions, files, errors, or load-bearing facts and appends them to a per-session journal (JSONL file, Postgres table, vector store — substrate-agnostic). The live buffer stays short — last 10-20 turns — by aggressive truncation. When the model needs earlier context, it queries the journal via a recall tool the harness exposes. The journal grows linearly; the live buffer is bounded by design.

The journal model has three properties worth naming. It replaces the orchestration-failure surface with a retrieval-quality surface: no compaction to fail, but a recall query whose quality determines whether the agent finds the relevant fact. It makes the trade-off explicit: the operator can inspect, query, and audit the journal — versus a summary block whose contents are at the summarizer’s discretion. And it plays well with sleep-time compute: the journal is exactly the artifact an offline consolidation pass needs.

The pattern shows up under several names in 2026 production systems. Doug Turnbull’s “give your coding agent a journal” is the cleanest articulation — an agent maintains a journal file in the working directory, one entry per significant action, queried when it needs to remember. OpenCode’s append-only journal blocks productize the idea with semantic search. LangGraph’s checkpoint-and-store split implements the architectural distinction at the framework level. The unifying claim: aggressive truncation plus a queryable external journal beats summarization-in-place for any task where audit and reproducibility matter more than the gist surviving in prose.

The journal wins for coding agents (audit trail needs to survive verbatim), compliance workflows (regulators want exact actions, not paraphrases), long-horizon research (the journal is the research log), and multi-session agents (journals are naturally cross-session). Compaction still wins for open-ended conversational agents where the gist is what matters, latency-sensitive interactive agents where the journal round-trip dominates, and single-session ephemeral workflows. The patterns are complementary; the production answer for most agents is both — a journal for durable state, compaction for the live buffer’s working set.

A preemptive, cache-aware compactor in Python

Realizes the full orchestration shape: preemptive triggering, snapshot-and-rollback, circuit-breaker recovery, cache-aware mutation. Uses the Anthropic SDK.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# pip install "anthropic>=0.89.0"
import copy
import json
import logging
import time
from dataclasses import dataclass, field
from typing import Any
from anthropic import Anthropic

log = logging.getLogger("compactor")
client = Anthropic()

SUMMARIZER_MODEL = "claude-haiku-4-5"
SUMMARY_PROMPT = """You are compacting an agent conversation. Read the messages and emit a JSON object with these keys exactly. Never paraphrase file paths, error codes, function names, or version numbers — copy them verbatim.

{"session_intent": "", "files_touched": [], "decisions": [], "pending_questions": [], "next_steps": []}

MESSAGES:
{messages}

Respond with the JSON object only.
"""

@dataclass
class CompactionConfig:
    soft_watermark: float = 0.70   # preemptive trigger
    hard_watermark: float = 0.95   # reactive fallback
    tail_keep: int = 10            # messages preserved verbatim after compaction
    max_failures: int = 3          # circuit-breaker trip threshold
    cooldown_turns: int = 5        # cool-down before breaker resets
    summarizer_max_tokens: int = 1024
    output_headroom_tokens: int = 4096  # reserved for the foreground reply

@dataclass
class CompactorState:
    consecutive_failures: int = 0
    breaker_tripped_at_turn: int | None = None
    last_summary: dict | None = None
    last_summary_at_turn: int = 0
    turn: int = 0

def estimate_tokens(messages: list[dict]) -> int:
    # ~4 chars per token; replace with the provider's tokenizer in production.
    return sum(len(json.dumps(m)) for m in messages) // 4

def project_next_turn(
    messages: list[dict],
    pending_tool_result_estimate: int,
    cfg: CompactionConfig,
) -> int:
    return (
        estimate_tokens(messages)
        + pending_tool_result_estimate
        + cfg.output_headroom_tokens
    )

class CompactorError(Exception): ...
class WedgedBufferError(CompactorError): ...
class BreakerTrippedError(CompactorError): ...

def _run_summarizer(messages: list[dict]) -> dict:
    # Validate output shape; raise on any malformed result so the caller can
    # snapshot-rollback rather than commit a broken summary.
    body = json.dumps(messages, ensure_ascii=False)
    resp = client.messages.create(
        model=SUMMARIZER_MODEL,
        max_tokens=1024,
        messages=[{"role": "user", "content": SUMMARY_PROMPT.format(messages=body)}],
    )
    text = resp.content[0].text.strip()
    if text.startswith("```"):
        text = text.split("```", 2)[1].lstrip("json\n").rstrip("`\n")
    obj = json.loads(text)  # malformed JSON raises here
    required = {"session_intent", "files_touched", "decisions",
                "pending_questions", "next_steps"}
    if not required.issubset(obj.keys()):
        raise CompactorError(f"summarizer missing keys: {required - obj.keys()}")
    # Empty-load-bearing-field check: a summary with no decisions and no files
    # touched after a 50-turn coding session is almost certainly a silent failure.
    if not (obj["decisions"] or obj["files_touched"]) and len(messages) > 20:
        raise CompactorError("summarizer returned empty load-bearing fields")
    return obj

def _chunked_and_merge(messages: list[dict]) -> dict:
    # Wedged-buffer escape hatch: split, summarize halves, summarize summaries.
    mid = len(messages) // 2
    left = _run_summarizer(messages[:mid])
    right = _run_summarizer(messages[mid:])
    merged_input = [
        {"role": "user", "content": "PARTIAL SUMMARY (first half):\n" + json.dumps(left)},
        {"role": "user", "content": "PARTIAL SUMMARY (second half):\n" + json.dumps(right)},
    ]
    return _run_summarizer(merged_input)

def _lossy_truncate(messages: list[dict], tail_keep: int) -> list[dict]:
    # Last-resort fallback when the circuit breaker is tripped or summarizer fails.
    system = [m for m in messages if m.get("role") == "system"]
    tail = [m for m in messages if m.get("role") != "system"][-tail_keep:]
    return system + [{"role": "user", "content": "[earlier history truncated; "
                                                 "compactor unavailable]"}] + tail

class Compactor:
    def __init__(self, cfg: CompactionConfig | None = None,
                 model_context_limit: int = 200_000):
        self.cfg = cfg or CompactionConfig()
        self.state = CompactorState()
        self.limit = model_context_limit

    def maybe_compact(
        self,
        messages: list[dict],
        pending_tool_result_estimate: int = 0,
    ) -> list[dict]:
        self.state.turn += 1
        projected = project_next_turn(messages, pending_tool_result_estimate, self.cfg)

        # Breaker cool-down: reset after the configured number of turns.
        if (self.state.breaker_tripped_at_turn is not None and
                self.state.turn - self.state.breaker_tripped_at_turn
                >= self.cfg.cooldown_turns):
            log.info("compactor.breaker_reset turn=%d", self.state.turn)
            self.state.breaker_tripped_at_turn = None
            self.state.consecutive_failures = 0

        if projected < self.limit * self.cfg.soft_watermark:
            return messages  # no compaction needed

        # If the breaker is tripped, go straight to lossy truncation.
        if self.state.breaker_tripped_at_turn is not None:
            log.warning("compactor.breaker_open fallback=truncate turn=%d",
                        self.state.turn)
            return _lossy_truncate(messages, self.cfg.tail_keep)

        # Snapshot the live buffer before any mutation.
        snapshot = copy.deepcopy(messages)
        try:
            if estimate_tokens(messages) > self.limit:  # wedged-buffer escape
                summary = _chunked_and_merge(messages[:-self.cfg.tail_keep])
            else:
                summary = _run_summarizer(messages[:-self.cfg.tail_keep])
        except Exception as e:
            self.state.consecutive_failures += 1
            log.error("compactor.failure n=%d err=%s",
                      self.state.consecutive_failures, e)
            if self.state.consecutive_failures >= self.cfg.max_failures:
                self.state.breaker_tripped_at_turn = self.state.turn
                log.error("compactor.breaker_tripped turn=%d", self.state.turn)
            # Atomic rollback: original buffer is unchanged.
            return _lossy_truncate(snapshot, self.cfg.tail_keep)

        # Success: build the new buffer atomically.
        self.state.consecutive_failures = 0
        self.state.last_summary = summary
        self.state.last_summary_at_turn = self.state.turn
        system = [m for m in messages if m.get("role") == "system"]
        tail = messages[-self.cfg.tail_keep:]
        summary_block = {
            "role": "user",
            "content": "PRIOR SESSION SUMMARY (immutable until next compaction):\n"
                       + json.dumps(summary, indent=2),
        }
        return system + [summary_block] + tail

The interesting parts aren’t the summarizer call — that’s the context-compression article’s domain — but the orchestration. The snapshot-and-rollback cannot leave the buffer half-rewritten. The breaker prevents the infinite-failure loop. The lossy-truncation fallback keeps the session moving. The chunked-and-merge escape hatch handles the wedged case. The preemptive trigger gates on projected, not on current buffer size, so head-room is reserved before the call.

Same shape in TypeScript with the Vercel AI SDK

The Vercel AI SDK’s prepareStep is the documented hook for conversation compaction in AI SDK 5; it runs before each model call and can rewrite the messages array. The orchestration shape ports over without modification.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
// npm install ai @ai-sdk/anthropic zod
import { generateObject, type ModelMessage } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";

const SummarySchema = z.object({
  session_intent: z.string(),
  files_touched: z.array(z.string()),
  decisions: z.array(z.string()),
  pending_questions: z.array(z.string()),
  next_steps: z.array(z.string()),
});
type Summary = z.infer<typeof SummarySchema>;

type CompactionConfig = {
  softWatermark: number;
  hardWatermark: number;
  tailKeep: number;
  maxFailures: number;
  cooldownTurns: number;
  outputHeadroomTokens: number;
};

const DEFAULT_CONFIG: CompactionConfig = {
  softWatermark: 0.7,
  hardWatermark: 0.95,
  tailKeep: 10,
  maxFailures: 3,
  cooldownTurns: 5,
  outputHeadroomTokens: 4096,
};

const SUMMARIZER = anthropic("claude-haiku-4-5");

const estimateTokens = (messages: ModelMessage[]): number =>
  messages.reduce((acc, m) => acc + JSON.stringify(m).length, 0) / 4;

const lossyTruncate = (
  messages: ModelMessage[],
  tailKeep: number,
): ModelMessage[] => {
  const sys = messages.filter((m) => m.role === "system");
  const rest = messages.filter((m) => m.role !== "system");
  return [
    ...sys,
    {
      role: "user",
      content: "[earlier history truncated; compactor unavailable]",
    },
    ...rest.slice(-tailKeep),
  ];
};

async function runSummarizer(messages: ModelMessage[]): Promise<Summary> {
  const { object } = await generateObject({
    model: SUMMARIZER,
    schema: SummarySchema,
    prompt: `Compact this agent conversation. Never paraphrase file paths, error codes, function names, or version numbers — copy them verbatim.\n\nMESSAGES:\n${JSON.stringify(messages)}`,
  });
  // Empty-load-bearing-field guard: a long session with no decisions and no
  // files touched is almost certainly a silent failure.
  if (
    messages.length > 20 &&
    object.decisions.length === 0 &&
    object.files_touched.length === 0
  ) {
    throw new Error("summarizer returned empty load-bearing fields");
  }
  return object;
}

export class Compactor {
  private failures = 0;
  private trippedAtTurn: number | null = null;
  private turn = 0;
  private lastSummary: Summary | null = null;

  constructor(
    private readonly modelContextLimit: number,
    private readonly cfg: CompactionConfig = DEFAULT_CONFIG,
  ) {}

  async maybeCompact(
    messages: ModelMessage[],
    pendingToolResultEstimate = 0,
  ): Promise<ModelMessage[]> {
    this.turn += 1;

    if (
      this.trippedAtTurn !== null &&
      this.turn - this.trippedAtTurn >= this.cfg.cooldownTurns
    ) {
      this.trippedAtTurn = null;
      this.failures = 0;
    }

    const projected =
      estimateTokens(messages) +
      pendingToolResultEstimate +
      this.cfg.outputHeadroomTokens;
    if (projected < this.modelContextLimit * this.cfg.softWatermark) {
      return messages;
    }

    if (this.trippedAtTurn !== null) {
      return lossyTruncate(messages, this.cfg.tailKeep);
    }

    // Snapshot via structuredClone; mutation only on success.
    const snapshot = structuredClone(messages);
    let summary: Summary;
    try {
      summary = await runSummarizer(messages.slice(0, -this.cfg.tailKeep));
    } catch (err) {
      this.failures += 1;
      if (this.failures >= this.cfg.maxFailures) {
        this.trippedAtTurn = this.turn;
      }
      return lossyTruncate(snapshot, this.cfg.tailKeep);
    }

    this.failures = 0;
    this.lastSummary = summary;
    const sys = messages.filter((m) => m.role === "system");
    const tail = messages.slice(-this.cfg.tailKeep);
    const summaryBlock: ModelMessage = {
      role: "user",
      content:
        "PRIOR SESSION SUMMARY (immutable until next compaction):\n" +
        JSON.stringify(summary, null, 2),
    };
    return [...sys, summaryBlock, ...tail];
  }
}

// Usage with prepareStep — the SDK hook that fires before each model call.
// const compactor = new Compactor(200_000);
// const result = await streamText({
//   model: anthropic("claude-opus-4-7"),
//   messages,
//   prepareStep: async ({ messages: stepMessages }) => ({
//     messages: await compactor.maybeCompact(stepMessages),
//   }),
// });

The orchestration is provider-agnostic: the trigger lives in the harness, the summarizer is one model call like any other, the breaker and snapshot are pure state machines. Only the SDK integration differs — prepareStep in the Vercel AI SDK, manual buffer reconstruction in the raw Anthropic SDK.

Trade-offs, failure modes, and gotchas

Don’t compact during a tool-use cycle. A rewrite between a tool_use block and its matching tool_result leaves the API in an inconsistent state — modern providers reject the malformed sequence. Compaction fires between completed turns, never mid-turn; if the trigger fires while a result is pending, defer. Same tool-call/tool-result pairing invariant the short-term memory article named for truncation.

Tokenizer drift (the opening bug). Token counters lag the production tokenizer. If the harness ships its own, add a 5-10% margin to the stated context limit. Cheap defense: use the provider’s tokenizer API. Expensive defense: count server-side via usage blocks. Never trust a third-party tiktoken or claude-tokenizer library exactly — production tokenizers update.

Summarizer-context smaller than foreground. Common config mistake: foreground is Opus with 1M-token context, summarizer is Haiku with 200K. When the buffer crosses 200K, the summarizer fails because its own context is exhausted. Fix: pick a summarizer with at least the foreground model’s context, or ship the chunked-and-merge fallback as first-class. The OpenAI Agents SDK’s OpenAIResponsesCompactionSession sidesteps this by compacting server-side; client-side compactors don’t get that luxury.

The re-reading loop. The summary-paraphrasing bug, named in depth in the context-compression article, shows up here as the agent calling Read on a file it already edited because the summary paraphrased the path. Beyond the schema-enforced verbatim-identifier rule, the orchestration defense is to retain a parallel structured-index store (last file modified, last error seen, last decision) outside the compacted buffer and re-inject the relevant entries as pinned context. The journal pattern provides this naturally; the compaction-only pattern adds it explicitly.

Repeated compaction on every turn. If the soft watermark fires but compaction only achieves a small ratio, the next turn still projects over the watermark and compaction fires again. The session burns calls on back-to-back compactions. Fix: set a post-compaction target (compress until the buffer is ≤50% of the soft watermark) rather than running compaction once on crossing. Compress to the floor, not just below the ceiling.

Compaction during streaming. If the watermark is crossed mid-stream, do not interrupt to compact. Let the stream complete, then compact before the next turn. Interrupting a stream to mutate the buffer is the same bug family as compacting mid-tool-use.

Goodhart on trigger rate. Optimizing for “compactions fired per session” without measuring failed-compaction cost tempts the operator to lower the breaker threshold and fall back to truncation aggressively. Compactions-fired drops; session quality silently degrades because more sessions run on truncated context. Track post-compaction probe recovery jointly — does the agent still answer “what file did we edit?” correctly.

Anthropic’s compaction beta does most of this for you. When you can use compact_20260112, the server-managed checkpoint eliminates client-side boundary risk, the breaker is implicit, and the cache-aware rewrite is handled in the API. The client-side patterns in this article are for provider portability, proxies that don’t expose the beta, or trigger logic more sophisticated than threshold-crossing. Use the server-side primitive when you can; build the harness primitive when you can’t.

  • Summarization and Context Compression — the sibling piece. This article handles when and how to fire compaction in a long session; the compression article handles what the compaction operation actually does — recursive summarization, structured note-taking, verbatim compaction, opaque compression, and the quality-loss diagnostics.
  • Anatomy of an Agent Harness — the runtime layer the compactor lives inside. Duty 5 (error recovery) and duty 4 (cache management) are the two duties this article specializes; the harness anatomy article is the integration view.
  • Prompt Caching: Reusing the KV Cache Across Calls — the cost lever the cache-aware-compaction pattern is built around. Compaction is one of the few operations that can destroy a cache hit rate overnight; the prompt-caching article is the upstream context for why the cache-awareness discipline pays off.
  • Long-Horizon Task Reliability — the broader recovery framework. The snapshot-and-rollback pattern, the circuit breaker, and the degraded-fallback discipline are all expressions of the same long-horizon-reliability primitives at a finer grain.

Further reading