jatin.blog ~ $
$ cat ai-engineering/working-memory-scratchpads.md

Working Memory: Scratchpads, Blackboards, and Agent Notebooks

Working memory for agents: scratchpads, blackboards, notebooks, and dataflow state — the in-context surface that sits above the conversation buffer.

Jatin Bansal@blog:~/ai-engineering$ open working-memory-scratchpads

A long-running coding agent hits turn 90 on a refactor that touches forty files. Its conversation buffer is healthy — the short-term memory eviction policy keeps the head and tail and trims the middle, the system prompt is pinned, headroom is reserved. The agent still gets the wrong answer. The reason is not what the buffer remembers; it’s what it never wrote down. The list of files already migrated, the open questions the agent flagged on turn 30, the failed approach from turn 45 — none of that was in a structured scratchpad the harness could re-inject on turn 90. The model re-derived state from the messy chat log, mis-counted what was done, edited two files twice and missed three others. The conversation buffer was fine. The agent had no working memory.

Opening bridge

Yesterday’s piece on the conversation buffer treated short-term memory as the in-context byte stream the harness sends to the model. That layer is necessary but it is not the entire in-context tier. The other half — the part the harness writes to explicitly, separately from the chat log — is working memory. The memory stack overview named it as the second piece of the in-context layer (“the scratchpad, blackboards, and agent notebooks for state mid-task”) and the cognitive taxonomy pinned the definition to CoALA’s working memory tier. Today’s piece is the deep dive: the substrates that working memory actually runs on in 2026, how each interacts with the conversation buffer, and the failure modes that show up when you confuse the two.

Definition

Working memory is an explicit, structured surface the agent reads from and writes to during a single task, distinct from the raw conversation buffer, that the harness re-serializes into the prompt on every turn it is read. Three properties separate it from short-term memory. First, it is structured — typed fields, a key/value store, a typed graph node, or a file on disk — rather than an opaque list of chat messages. Second, the writes are deliberate — the agent invokes a tool or emits a structured directive to update it, not as a side effect of saying something in conversation. Third, the lifetime is the task, not the turn: working memory is intended to outlast eviction of the messages that produced it, but it is not the durable long-term store; when the task ends the harness can drop it or promote selected items to episodic or semantic memory.

What working memory is not, in this taxonomy. It is not the conversation buffer (that’s short-term memory — implicit, message-shaped, FIFO-evicted). It is not long-term episodic storage (that’s the vector store of past sessions). It is not the model’s hidden activations (those evaporate at the end of each forward pass). It is the agent’s notebook for the current task, materialized as something the harness can edit, audit, and inject as a coherent block on every model call.

Intuition

The mental model that pays off is the typed local variables of a long-running function call. The conversation buffer is stdin/stdout of the session — what got said. Working memory is the function’s locals — the variables the function is actively using to compute its result, never serialized to the user but always available to the function body. When the function decides “I’ll need this later,” it stores it in a local; later in the function it reads from that local. The conversation transcript is the IO log; the working memory is the program state.

Three concrete shapes the locals take in modern agent harnesses:

  1. The chain-of-thought scratchpad — an unstructured text buffer the model writes to with prose (“Plan: 1. read all files. 2. find usages. 3. update each.”). The simplest form, prompted into existence inside the ReAct thought channel.
  2. The typed state object — a structured record with explicit fields ({plan: string[], completed_files: string[], open_questions: string[]}) that the harness mutates from tool calls or model directives.
  3. The external notebook — a file or set of files outside the prompt that the agent writes to via a memory or file_edit tool and reads from on demand. The notebook crosses session boundaries when persisted, blurring into long-term memory; the working-memory framing covers the within-task usage.

A fourth shape that comes back later in the subtree: the shared blackboard that multiple agents read and write concurrently — the only working-memory substrate where concurrency is a first-class concern.

The distributed-systems parallel: a dataflow graph

Three honest parallels worth naming.

A scratchpad is the agent’s call stack. The conversation buffer is the IO log; the scratchpad is the program’s runtime state — the locals, the partial plan, the in-flight intermediate values. A function call without locals would have to recompute everything on every line; a long-running agent loop without working memory does the same. The classic “the agent re-derived state from the chat log” bug is the agent recomputing locals every turn because the harness gave it nowhere persistent to store them.

Typed working memory is a dataflow graph, not a queue. LangGraph’s State schema is explicitly typed: each field has a reducer (operator.add for append, plain assignment for overwrite). When a node returns {notes: ["..."]} the reducer appends; when a node returns {plan: "..."} the reducer overwrites. The State is the graph’s mutable working memory and every node sees the latest value via the reducer-applied merge. The pattern is the same one Apache Beam or Flink use: typed channels with explicit accumulator semantics, ergonomic for nodes that need both append-only history and overwrite-on-each-step scratch. The agent loop maps onto this exactly — messages is an append-only channel, plan is an overwrite-on-revision channel, notes is an append-only side channel, and the model reads all three on every step.

A shared blackboard is a persistent message broker for agents. The Hearsay-II speech-understanding system (1971–1976) introduced the blackboard architecture — a global shared working memory where independent knowledge sources posted hypotheses, read each other’s hypotheses, and triggered new computation when relevant entries appeared. The pattern reads like a long-lived bulletin board with a scheduler that wakes whichever specialist’s preconditions match the current blackboard state. The 2025 paper Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture (Lu & Sasaki, 2025) ports the pattern directly to LLM agents — central controller posts the problem, agents volunteer based on blackboard contents, results land back on the board until consensus. The follow-up LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science (2025) reports 13–57% relative gains over baselines on data-science discovery. The relevance here: a blackboard is just working memory shared across multiple agents, and the consistency problems it surfaces — write conflicts, stale reads, write skew — are the multi-agent split-brain problems we have already named, surfaced at the working-memory layer.

Substrate 1: The chain-of-thought scratchpad

The simplest working-memory substrate. The model writes structured prose into a thought channel (the <thinking> block in extended-thinking mode, the Thought: line in classic ReAct, or an unstructured “scratch” assistant turn) and the harness either keeps that text in the buffer or extracts it into a side channel for the next turn.

When this is right. Short-horizon tasks, single-shot reasoning chains, anywhere the cost of structured state would exceed the cost of letting the model recompute. The original ReAct paper (Yao et al., 2022) uses exactly this — the thought/action/observation sequence is the scratchpad, replayed in full to the model on every step.

When this is wrong. Anything that runs past 20 turns. The scratchpad grows linearly in the chat log, falls into the eviction-middle of the short-term memory buffer, and the lost-in-the-middle effect kicks in. By turn 60 the model has the scratchpad textually but isn’t attending to most of it. The structured-state substrate below is the upgrade.

Anthropic’s “think” tool is the modern incarnation. Anthropic’s think-tool post (March 2025) describes it as “a designated space to stop and think about whether it has all the information it needs to move forward” during response generation, not before. The think tool is a chain-of-thought scratchpad with a tool boundary — the model invokes it explicitly, the contents land in a tool_use block, and the harness can choose to keep or evict it on the next turn. τ-Bench airline-domain numbers in the post show a 54% relative improvement when paired with optimized prompting. Treat it as a structured CoT scratchpad with explicit eviction control rather than something exotic.

Substrate 2: The typed state object (the dataflow-graph approach)

The harness exposes a typed state schema. The agent (model + tools) reads and writes named fields. The harness re-serializes the schema into the prompt on every model call, optionally elided to the fields the current node needs.

This is the default in LangGraph, OpenAI’s Agents SDK RunContextWrapper, and Letta’s core memory blocks.

Python: LangGraph state with overwrite vs append channels

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# pip install "langgraph>=0.2" "langchain-anthropic>=0.2" "anthropic>=0.40"
import operator
from typing import Annotated, TypedDict
from langchain_core.messages import AnyMessage, HumanMessage, SystemMessage
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph, START, END

# Two channels with different reducers — the dataflow-graph idea made concrete.
# `messages` accumulates (operator.add appends list-to-list).
# `plan`, `completed`, `notes` overwrite on each write; this is the working-memory tier
# the harness keeps explicit, separate from the chat buffer.
class AgentState(TypedDict):
    messages: Annotated[list[AnyMessage], operator.add]
    plan: list[str]              # overwrite — the model rewrites the plan as it learns
    completed: list[str]         # overwrite — the model emits the full list each time
    notes: Annotated[list[str], operator.add]  # append — facts pinned across turns

llm = ChatAnthropic(model="claude-opus-4-7", max_tokens=2048)

def working_memory_prompt(state: AgentState) -> SystemMessage:
    """Render the working-memory block that gets injected on every model call.
    This is the bit that distinguishes working memory from short-term memory: it's
    a deliberate, structured slab the harness assembles, not whatever the model
    happened to say in the last 20 turns."""
    lines = ["## Working memory (your private scratchpad — re-injected every turn)"]
    if state.get("plan"):
        lines.append("### Current plan")
        lines.extend(f"- {step}" for step in state["plan"])
    if state.get("completed"):
        lines.append("### Completed steps")
        lines.extend(f"- {step}" for step in state["completed"])
    if state.get("notes"):
        lines.append("### Pinned notes")
        lines.extend(f"- {note}" for note in state["notes"])
    return SystemMessage(content="\n".join(lines))

def model_node(state: AgentState) -> dict:
    """Standard LLM step. The working-memory block is prepended to the chat
    history; the model can update working memory by emitting a structured
    directive in its next tool call (omitted here for brevity)."""
    wm_block = working_memory_prompt(state)
    response = llm.invoke([wm_block] + state["messages"])
    return {"messages": [response]}

graph = StateGraph(AgentState)
graph.add_node("model", model_node)
graph.add_edge(START, "model")
graph.add_edge("model", END)
compiled = graph.compile()

# Initial state — the working-memory fields are explicit, separate from `messages`.
result = compiled.invoke({
    "messages": [HumanMessage("Refactor utils.py to remove the global state.")],
    "plan": [
        "Identify every reader of the global.",
        "Introduce a Config dataclass.",
        "Thread Config through call sites.",
        "Delete the global.",
    ],
    "completed": [],
    "notes": ["The global is named CONFIG and is set in utils.py:14."],
})

The point is the separation. messages is the conversation buffer (short-term memory). plan, completed, notes are working memory — typed, explicit, with per-field reducers chosen for the access pattern. On turn 90 the chat buffer can be aggressively trimmed; the working-memory block is re-injected in full and the agent never loses the plan. This is also how you avoid the token-accumulation bug in recursive agent loops: an overwrite reducer on plan keeps the field bounded even after 200 revisions, where an append reducer would grow linearly.

TypeScript: OpenAI Agents SDK with a typed context

The OpenAI Agents SDK (TypeScript) exposes the same idea through a typed context object accessible from tools and lifecycle hooks.

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// pnpm add @openai/agents zod
import { Agent, Runner, tool } from "@openai/agents";
import { z } from "zod";

// The working-memory shape lives in TypeScript, not in the prompt.
interface WorkingMemory {
  plan: string[];
  completed: string[];
  notes: string[];
}

const initialMemory: WorkingMemory = {
  plan: [
    "Identify every reader of the global.",
    "Introduce a Config dataclass.",
    "Thread Config through call sites.",
    "Delete the global.",
  ],
  completed: [],
  notes: ["The global is named CONFIG and is set in utils.ts:14."],
};

// Tools receive the typed context and can mutate working memory directly.
const markCompleted = tool({
  name: "mark_completed",
  description: "Mark a plan step as done; moves it from plan to completed.",
  parameters: z.object({ step: z.string() }),
  async execute({ step }, runContext) {
    const mem = runContext.context as WorkingMemory;
    mem.plan = mem.plan.filter((s) => s !== step);
    mem.completed.push(step);
    return `Marked completed: ${step}`;
  },
});

const addNote = tool({
  name: "add_note",
  description: "Pin a fact in working memory.",
  parameters: z.object({ note: z.string() }),
  async execute({ note }, runContext) {
    (runContext.context as WorkingMemory).notes.push(note);
    return `Pinned: ${note}`;
  },
});

// The agent's instructions are dynamic: the working-memory block is rendered
// from the typed context every turn. This is the key move — the prompt always
// shows the latest working memory, no matter how trimmed the conversation
// buffer is.
const refactorAgent = new Agent<WorkingMemory>({
  name: "refactor-agent",
  model: "gpt-5",
  tools: [markCompleted, addNote],
  instructions: (runContext, agent) => {
    const mem = runContext.context;
    return [
      "You refactor TypeScript codebases. Use the working-memory tools to track progress.",
      "",
      "## Working memory (your private scratchpad)",
      "### Plan",
      ...mem.plan.map((s) => `- ${s}`),
      "### Completed",
      ...(mem.completed.length ? mem.completed.map((s) => `- ${s}`) : ["- (none yet)"]),
      "### Pinned notes",
      ...mem.notes.map((n) => `- ${n}`),
    ].join("\n");
  },
});

const result = await Runner.run(refactorAgent, "Continue the refactor.", {
  context: initialMemory,
});

The instructions function is the working-memory renderer. The typed WorkingMemory is the actual state. The two are coupled by the agent’s contract: tools mutate the typed state, the instructions render it back into the prompt on every turn. This is the typed-dataflow-graph substrate in OpenAI-Agents-SDK terms. The official cookbook Context Engineering for Personalization is the production-grade reference for the pattern.

Substrate 3: The external notebook (a memory tool)

The model invokes a tool to write to a file or key-value store outside the prompt. Reads happen via the same tool on demand. The notebook persists across turns and, with a backing store, across sessions — at which point it starts to overlap with long-term memory (the next article in the subtree).

Anthropic’s memory tool, released as the memory_20250818 server-side tool type, is the canonical reference. The tool exposes view, create, str_replace, insert, delete, and rename commands operating on a client-side /memories directory. The harness implements the file operations locally; the model invokes them via tool calls. The auto-injected system-prompt instruction tells the model to always view the memory directory before doing anything else and to record progress as it goes — explicitly framing the tool as a working-memory substrate that survives both context-window eviction and full-session resets.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# pip install "anthropic>=0.40"
# Minimal notebook substrate using Anthropic's memory tool.
from pathlib import Path
import anthropic

MEMORY_ROOT = Path("./agent_memory").resolve()
MEMORY_ROOT.mkdir(exist_ok=True)

def handle_memory_tool(tool_input: dict) -> str:
    """Client-side handler for the memory_20250818 tool. Path-traversal-safe."""
    cmd = tool_input["command"]
    # Validate every path stays under MEMORY_ROOT — the Anthropic docs flag
    # this as the single most important safeguard.
    def safe_path(raw: str) -> Path:
        p = (MEMORY_ROOT / raw.lstrip("/")).resolve()
        if not p.is_relative_to(MEMORY_ROOT):
            raise ValueError(f"Path escapes memory root: {raw}")
        return p

    if cmd == "view":
        p = safe_path(tool_input["path"])
        if p.is_dir():
            entries = "\n".join(f"{f.stat().st_size}\t/{f.relative_to(MEMORY_ROOT)}"
                                for f in sorted(p.iterdir()))
            return f"Directory listing for {tool_input['path']}:\n{entries}"
        return p.read_text()
    if cmd == "create":
        safe_path(tool_input["path"]).write_text(tool_input["file_text"])
        return f"File created at {tool_input['path']}"
    if cmd == "str_replace":
        p = safe_path(tool_input["path"])
        text = p.read_text()
        if tool_input["old_str"] not in text:
            return f"No replacement: old_str not found in {tool_input['path']}"
        p.write_text(text.replace(tool_input["old_str"], tool_input["new_str"], 1))
        return "The memory file has been edited."
    # Implement insert/delete/rename similarly in production.
    raise ValueError(f"Unknown memory command: {cmd}")

client = anthropic.Anthropic()

# The tool registration is one line; Anthropic injects the working-memory
# protocol into the system prompt automatically.
def run_agent_turn(messages: list[dict]) -> dict:
    return client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        messages=messages,
        tools=[{"type": "memory_20250818", "name": "memory"}],
    )

# The harness loop dispatches memory tool calls back to the local handler.
# Loop omitted for brevity — see anthropic-sdk-python/examples/memory/basic.py
# for a full runnable version.

The trade with the notebook substrate is explicitness for latency. Every read costs a tool round-trip. The typed-state-object substrate is free to read (it’s in the prompt every turn); the notebook is paid-per-read. The right call depends on working set size: if the working memory fits in a 2KB block, render it every turn and skip the tool; if it grows past tens of KB and most of it is irrelevant to most steps, the notebook lets the agent page in only what it needs — the same JIT-vs-AOT context engineering decision, applied at the working-memory layer instead of the retrieval layer.

Letta’s core memory blocks implement the same idea with a different boundary: a fixed set of small, addressable working-memory blocks (persona, human, scratchpad) that the agent edits through core_memory_append and core_memory_replace tools, with the contents always visible in the prompt. Same pattern, different ergonomics — Letta’s blocks are always in-context; Anthropic’s /memories directory is in-context only on demand.

Substrate 4: The shared blackboard

When working memory is shared across multiple agents, the substrate becomes a blackboard. The 1971 Hearsay-II architecture is the canonical reference: a global typed store, multiple knowledge sources reading and writing concurrently, a scheduler activating the next knowledge source based on board state.

The 2025 LLM blackboard papers replay the same pattern with LLMs as the knowledge sources: a central controller posts the problem onto a shared board, sub-agents inspect the board and self-nominate based on what’s relevant to their role, results land back on the board, the loop continues until the controller decides consensus has been reached. The reported gains over baselines come from two places — agents self-selecting based on board state (no central router), and shared visibility cutting redundant work across agents.

The implementation surface is what you’d expect from a distributed shared-memory system: every read needs a stable snapshot; every write needs conflict resolution (last-writer-wins is the floor; CRDT-style merges are the ceiling); the split-brain failure mode from the multi-agent orchestration article is the same disease here, applied to the working-memory layer instead of the messaging layer. The deeper treatment of shared agent memory lands later in the subtree under multi-agent shared memory; for now the takeaway is that the blackboard is the multi-writer specialization of the typed-state-object substrate, with concurrency control bolted on.

Trade-offs, failure modes, and gotchas

The double-injection bug. The most common production failure with structured working memory. The model’s previous turn included a thought-process scratchpad (“Plan: …”). The harness also renders a structured working-memory block into the system prompt. Now the model sees the plan twice — once verbatim from its own previous output (still in the chat buffer), once from the harness’s rendered block. The model gets confused about which is authoritative and the two diverge across turns. The fix: when the harness manages working memory, it should strip or summarize the model’s raw scratchpad output from the buffer (or never let it land in the buffer in the first place — the typed-state pattern with tool-call-only writes does this automatically).

The unbounded scratchpad bug. Working memory with no eviction policy is just a slower context-length crash. An append reducer on notes with no pruning grows linearly; by turn 200 the working-memory block consumes more tokens than the chat buffer. The fix is a maintenance pass — either a model-driven compact_notes tool that the agent invokes when it sees the list getting long, or a harness-level periodic summarization. The conversation compaction article (coming later in the Agents subtree) covers the harness side; the working-memory equivalent is the same problem at a finer grain.

The state-vs-buffer race. When the typed state and the conversation buffer disagree (the buffer says the agent completed step 3, the state.completed list doesn’t have it), which wins? Always make one of them the source of truth and treat the other as derived. The typed state is almost always the right source of truth for working memory; the buffer is the IO log. If the model says “I completed step 3” but never invoked mark_completed, the harness should not retroactively update state — the missing tool call is the signal.

The “I’ll just dump JSON in the system prompt” anti-pattern. Tempting because it’s the lowest-effort way to get a structured scratchpad. Two failure modes. First, the JSON ends up displayed to the model as text; the model attends to it like prose, including all the syntactic noise. Second, when the structure grows, the JSON gets long and unscannable. The typed-state approach renders the working memory as markdown headed lists (as in the LangGraph example above) precisely because the model attends to short, headed sections much better than to dense JSON.

The persistence-vs-recomputation question. Some working-memory items are cheap to recompute (the agent can re-read a file); some are expensive (the agent solved a tricky off-by-one and the reasoning trace took 30 seconds). Persist the expensive ones, let the cheap ones evict. The skill-library treatment in Voyager (Wang et al., 2023) is the same pattern at a different time-scale — Voyager persists successful code and re-uses it across tasks, which is procedural memory; working memory is the within-task version of “remember the expensive intermediate.”

The blackboard concurrency gotcha. A multi-agent system with a shared blackboard but no write coordination will produce write-skew bugs that read like model hallucinations. Agent A reads the board and sees an open question. Agent B reads the same board, sees the same open question, both answer it independently, both post answers, the next reader sees two answers and is confused about which is current. The fix is the same set of techniques that work on any distributed mutable state: optimistic concurrency with version stamps, or pessimistic locking with a coordinator, or CRDT-style merges. The 2025 blackboard papers above all use a coordinator-mediated approach for exactly this reason.

The cache-invalidation trap. Rendering working memory into the system prompt every turn is the natural pattern — but it also invalidates the prompt cache on every turn because the system prompt content changes. Two mitigations: (1) put the working-memory block after the static portion of the system prompt and after any cache breakpoints so only the tail of the prompt is cache-cold, or (2) keep the rendered working memory in a dedicated user message just before the latest turn (some providers cache up to the last breakpoint regardless of where it sits). The Anthropic memory tool side-steps this entirely — the memory contents only enter the prompt via tool results, not via the static system prompt, so the system prompt stays cache-hot.

Further reading

  • Long-Term Memory: Vector-Backed Episodic Storage — the direct sequel. Working memory is in-context and task-scoped; long-term memory is the durable, vector-indexed store that survives the session boundary and gets retrieved on demand. The promotion path from working memory at task end into long-term episodic storage is where this article hands off.
  • Short-Term Memory: Managing the Conversation Buffer — the sibling article. Working memory sits above the conversation buffer; reading both together is how the in-context tier becomes legible. The buffer is the IO log; the scratchpad is the program state.
  • The Cognitive Taxonomy: Semantic, Episodic, Procedural — the upstream piece. Working memory is the in-context tier of the four-type taxonomy; this article specialized it into runnable substrates.
  • The Memory Stack: A Map of AI Memory — the parent article. Working memory is one of the two halves of the in-context layer; the storage layer is where the next several articles head.