jatin.blog ~ $
$ cat ai-engineering/memory-cognitive-taxonomy.md

The Cognitive Taxonomy: Semantic, Episodic, Procedural

A close read of the four cognitive memory types — working, episodic, semantic, procedural — and the CPU cache hierarchy each one maps onto.

Jatin Bansal@blog:~/ai-engineering$ open memory-cognitive-taxonomy

A coding agent ships on Monday with a single in-context buffer and a vector store called “memory.” On Wednesday a senior engineer asks why the bot keeps re-deriving the same five-step deployment ritual it has executed forty times this week. On Thursday a PM asks why the bot re-asks every new user whether their codebase uses pnpm or npm. On Friday a junior engineer asks why the bot just told the new joiner that the team uses Postgres, when the team migrated to Cassandra three months ago. Three complaints, three different memory failures, all routed to the same vector store because the team was missing the same vocabulary. The Wednesday complaint is a missing procedural memory. The Thursday complaint is a missing semantic store. The Friday complaint is a missing conflict-resolution policy on the semantic store. Without separate names for these layers, every disagreement on the team becomes “we need better memory” and every fix is the wrong fix. The cognitive taxonomy — working, episodic, semantic, procedural — is how you stop talking past each other.

Opening bridge

Yesterday’s piece, The Memory Stack: A Map of AI Memory, drew the full map: the in-context vs storage line, the four cognitive types, the read/write/maintain axes. Today’s piece zooms in on the four types themselves. Each one has its own write pattern, its own read pattern, its own substrate, and its own failure mode — and the engineering literature has settled on a remarkably tight set of analogies from the CPU cache hierarchy that make the choices legible. Working memory is L1. Episodic memory is the write-ahead log. Semantic memory is the keyed fact store. Procedural memory is the JIT-compiled-routine cache. If you internalize those four parallels, the rest of the Memory subtree clicks into place; if you don’t, every later article reads like jargon.

Where the taxonomy comes from

The four-type frame is not a 2023 invention. It is a cognitive-science distillation that the LLM literature inherited largely intact, and knowing the lineage helps you reason about the edge cases.

Endel Tulving’s 1972 episodic-vs-semantic distinction drew the first sharp line: episodic memory is autobiographical and context-bound (it has a when and a where); semantic memory is factual and context-free. The cup of coffee you had on Tuesday morning at the corner café is episodic; “coffee contains caffeine” is semantic. Tulving’s later five-system taxonomy added priming, procedural, and perceptual memory, but the episodic/semantic line remained the load-bearing one.

Larry Squire’s 1980s declarative-vs-non-declarative taxonomy added the second cut: declarative memory (episodic + semantic) is what you can say; non-declarative memory (procedural + priming) is what you can do. Procedural memory is “knowing how” — bike riding, touch typing, the muscle memory of the deployment script you don’t have to think about any more. The brain-systems evidence is overwhelming: amnesic patients can lose declarative memory entirely (they can’t remember anything new from yesterday) while retaining procedural memory (they can still learn new motor skills). The two systems are physically separate.

The CoALA paper — Cognitive Architectures for Language Agents (Sumers, Yao, Narasimhan, Griffiths, 2023) — ported this vocabulary into LLM engineering by collapsing Tulving’s five into four and giving each a precise computational role: working memory for the current call’s scratchpad, episodic memory for past observations indexed by time, semantic memory for distilled facts about the world, procedural memory for cached action sequences. The split matters because each cell has different storage characteristics — and choosing the wrong substrate for a given cell is the most common memory design mistake in production agents. The recent SOAR architecture overview (Laird, 2022) makes the same split explicit on the symbolic-AI side and has done so since the 1980s; the LLM world is rediscovering what cognitive architectures already had.

The distributed-systems parallel: the CPU cache hierarchy

Before walking through each type, here is the parallel that holds the whole frame together. The cache-hierarchy of a modern CPU has four conceptually distinct cells, and they map one-to-one onto CoALA’s four memory types.

Memory typeCache parallelCapacityLatencyPersistenceWrite rate
Working memoryL1 data cacheTiny (a few KB)Sub-nanosecondVanishes on context flushConstant during a task
Episodic memoryDisk / WALEffectively unboundedSlow (a retrieval round-trip)Permanent unless GC’dAppend-only, every turn
Semantic memoryL2/L3 keyed storeBoundedFast lookup if keyed; slow if searchedPermanentSlow, distillation-gated
Procedural memoryL1 instruction cache / JIT-compiled-code cacheSmallFast retrieval, expensive to populatePermanentVery slow, success-gated

The CPU cache hierarchy builds its design around the trade-off between speed, size, and the cost of populating each level. L1 is 32–64 KB and three cycles away; L3 is tens of megabytes and forty-plus cycles away; the L1 instruction cache is separate from the L1 data cache because instructions and data have different access patterns and benefit from different prefetch policies. Memory systems for agents have the same shape and for almost the same reasons — and the separate instruction cache observation is the single most under-appreciated parallel in agent design. The model’s reasoning is data; the agent’s cached procedures are instructions; they want different homes.

Working memory: the L1 data cache

What it is. The scratchpad the agent uses during the current task: the partial plan being assembled, the intermediate tool results not yet committed elsewhere, the reasoning trace the model is iterating on. Lives in-context by default and is gone the moment the context window flushes. CoALA’s working memory is essentially “the bytes the model is staring at right now that aren’t part of long-term storage” — the running variables of the agent’s mental program.

Substrate. Almost always the prompt itself. In a ReAct agent the working memory is the thought/action/observation history; in a plan-and-execute agent it is the structured Plan object plus the executor’s running state. Some frameworks (LangGraph’s State, OpenAI Agents SDK’s session state, Letta’s core memory blocks) make working memory an explicit object in the harness rather than implicit in the prompt — but it still gets serialized into the prompt before every model call. The boundary is the same as the in-context vs storage boundary from yesterday: working memory is the in-context side. The working memory deep dive later in this subtree walks through each substrate (chain-of-thought scratchpad, typed dataflow graph, external notebook, multi-agent blackboard) with runnable code.

Write pattern. Constant during a task. Every tool result, every model output, every observation lands in working memory by default. The harness’s job is to keep working memory below the token budget; the context engineering article covered the JIT-vs-AOT mechanics that decide what stays in the working set and what gets paged out.

Read pattern. Implicit. The model reads working memory because the model reads the prompt; there is no separate retrieval step. Latency is the latency of the next attention pass.

Failure mode. Eviction without persistence. The classic anti-pattern is letting the harness silently truncate the oldest messages and assuming nothing important was in them. The deployment script the agent figured out in turn 12 gets evicted by turn 80 and the agent re-derives it from scratch in turn 81 — which is the Wednesday complaint from the hook (partially; procedural memory is the more direct fix, see below). The fix is not “make working memory bigger.” The fix is to commit important working-memory items into a more durable tier — episodic or procedural — before eviction happens. Working memory’s job is speed, not persistence; using it as the storage tier is using L1 as your hard drive.

The next article in this subtree, short-term memory, goes deep on the conversation buffer specifically — truncation policies, sliding windows, message-level vs token-level eviction, headroom budgeting. The one after that, working memory: scratchpads, blackboards, and agent notebooks, covers the structured side of the in-context tier as a richer abstraction.

Episodic memory: the write-ahead log

What it is. “What happened, when, in what context.” An append-only, time-ordered record of past interactions, observations, and outcomes. The user’s question from Tuesday is an episode; the tool result you logged at 03:14 is an episode; the assistant turn where the agent committed a decision is an episode. Each episode carries enough context to be replayable in isolation — at minimum: the actor (user/assistant/tool), the timestamp, the content, and the session ID. Better systems also store an importance score and a coarse-grained “what kind of thing is this” tag for filtered retrieval. The long-term memory article is the deep dive on this tier as it actually ships in production — episode boundaries, write gates, the recency-weighted read path, and the framework choices that go with it.

Substrate. Almost universally a vector store with structured metadata: the text gets embedded for similarity search, and the metadata (timestamp, session, role, importance) supports filtering and recency-weighted ranking. The vector databases article covered the substrate choices; for episodic memory the salient knobs are filtered-ANN support and the cost of metadata-heavy queries. Mem0 keeps its episodic store in Qdrant or Chroma; Zep uses a hybrid Postgres + Neo4j layout where the temporal graph carries the metadata; Letta calls this layer “recall memory” and backs it with the conversation message buffer plus an embedded archive.

Write pattern. Append-only, every meaningful turn. The closest distributed-systems analogue is the write-ahead log: monotonic, ordered, replayable, never updated in place. Updates are achieved by appending a correction episode, not by rewriting the original — the same pattern as an event-sourced ledger, and for the same reason: the historical record is the source of truth, and rewriting history makes audit and rollback impossible. The write itself is usually cheap: embed the text, insert the row, done. The expensive part is the filtering — deciding which turns are worth recording at all. A naive “store every turn” policy works for a while and then falls over around the 10k-episode mark, when retrieval signal-to-noise collapses.

Read pattern. Recency-weighted similarity search, almost always. The canonical formulation is the Generative Agents memory-retrieval score (Park et al., 2023): score = α·recency + β·importance + γ·similarity, each term normalized to [0, 1], with exponential time decay on recency and an LLM-rated importance score assigned at write time. The signals are orthogonal — pure cosine retrieves on “what’s textually close,” recency biases toward what just happened, importance biases toward what the agent flagged as load-bearing. A later article in this subtree works through the formula in detail; for now the principle: episodic retrieval is not RAG retrieval, even when the substrate is the same vector store, because the read signals include time and salience.

Failure mode. Two big ones. First, write amplification: storing every utterance bloats the corpus, kills retrieval precision, and grows infrastructure costs proportionally. The fix is an explicit write policy — a classifier (heuristic or learned) that decides which turns earn an episode. Second, temporal staleness: a fact mentioned in March about the team using Postgres is still in the episodic store in May after the team migrated to Cassandra, and recency-weighted retrieval pulls the newer episode but not before the older one biases the prompt. This is the Friday complaint from the hook; the fix is not in episodic memory at all but in the semantic memory’s conflict-resolution policy, which we’ll get to in two sections.

The write-ahead-log parallel matters operationally. A good episodic store has the same properties as a good WAL: bounded write latency, ordered reads, replay-ability, retention policies that align with the workload’s recovery requirements. Treating it like a normal database — with updates and deletes happening in place — destroys the replay-ability that makes the store useful for reflection and audit later.

Semantic memory: the keyed fact store

What it is. “What is true about the world that this agent should know.” Distilled, context-free facts extracted from episodes (or seeded from outside the loop). “The user prefers dark mode.” “The API key rotates every 90 days.” “The team uses Cassandra as of March 2026.” Semantic memory is what makes the agent feel like it knows the user/system/domain rather than constantly relearning them.

Substrate. Variable, and the choice matters more than for episodic memory. Three common shapes:

  • Key-value or document store for explicitly keyed facts: user_preferences.theme = "dark", infra.database = "cassandra". Read pattern is direct lookup, which is the fast path. Mem0’s “facts” layer and Letta’s core-memory blocks (the user/persona/task blocks pinned in-context) are this shape. Best for facts the agent will read on every call.
  • Vector store with structured tags for facts whose keys are not known in advance: “things the user has expressed an opinion about,” “things about the deployment environment.” Read pattern is similarity search; latency is a retrieval round-trip. Best when the corpus of facts is large enough that pinning all of them in-context isn’t tractable.
  • Knowledge graph when relationships between facts matter: who reports to whom, which service depends on which, which preference was stated by which family member. Graphiti and Zep’s graph backend take this approach. The graph-memory article goes deep on when graphs beat vectors; the short version is “when your facts have entity-relation structure, when temporal point-in-time queries matter, or when you need to traverse multi-hop relations.”

Write pattern. Slow, distillation-gated. A new semantic fact gets written when the agent (or a write-time classifier) decides an episode contains a durable, generalizable claim worth lifting out. The write is usually a model call: take the recent episode, return a JSON object of the form {"key": "...", "value": "...", "confidence": 0.8}. The Mem0 paper (Chhikara et al., 2024) shows empirically that aggressive write-time filtering is where most of the recall accuracy comes from — and that the naive “extract every fact you can” policy actively hurts downstream retrieval because the corpus fills with low-value claims.

Read pattern. Two-mode. Keyed facts are read on every relevant call (the agent’s name, the user’s known preferences, the active task — what Letta pins as “core memory”). Searched facts are retrieved on demand by similarity. The two modes are not interchangeable: a fact that needs to be present on every call should be pinned, not searched, because the recall@k of similarity search is never 1.0. A common production mistake is putting the user’s name in the searchable store and then wondering why the agent occasionally addresses them by the wrong name.

Failure mode. Three.

  1. Stale facts. “The team uses Postgres” written in March is still present in May after the migration. The Friday-hook complaint. The fix is a conflict-resolution policy at write time: when a new fact for an existing key arrives, the system has to decide whether to overwrite, append-with-timestamp, or flag for human review. Most production systems are too permissive here; they let the latest write win silently, which means a single hallucinated extraction can corrupt a long-lived fact. A more defensible default is timestamped versions with a confidence threshold for overwrite — the new value has to clear a confidence bar, and the old value is retained as a historical version.
  2. Over-extraction. Storing every observation as a “fact” pollutes the store with episodes that should have been logged in episodic memory. A good write policy classifies the extraction type: is this a durable fact (semantic), an event (episodic), or both? Most things are both, but the primary home matters because reads against the wrong store retrieve the wrong shape of result.
  3. The cold-start problem. A fresh agent has no semantic memory; it needs to bootstrap from somewhere. Three common starting points: seed from a static knowledge base (the customer’s product documentation), seed from the user’s existing profile (their CRM record), or accept the cold-start cost and let the agent learn from interactions. The choice depends on the use case; the trap is pretending the cold-start problem doesn’t exist.

Procedural memory: the JIT-compiled-routine cache

What it is. “How to do X.” Cached action sequences that succeed for a class of tasks. “To deploy this service, do checkout → build → migrate → push → verify.” “To onboard a new user, do welcome-message → preference-collection → first-task-suggestion.” Procedural memory is the agent’s learned skills, separated from both the conversation history (episodic) and the world facts (semantic) because they have a fundamentally different lifecycle: they’re written rarely, after success has been observed; they’re read frequently, every time a task with the right shape appears.

Substrate. Almost always a separate index — distinct from episodic and semantic — keyed by task description embeddings, valued by the code/prompt/plan that succeeds for that task. Voyager (Wang et al., 2023), the Minecraft lifelong-learning agent, is the cleanest production-shaped example: its skill library is JavaScript functions indexed by the embedding of their natural-language descriptions, and on each new task the top-5 most-relevant skills get retrieved and injected into the prompt. The result is striking: Voyager unlocks Minecraft tech-tree milestones up to 15.3× faster than prior agents, and the skill library generalizes to fresh Minecraft worlds. The mechanism is exactly the JIT-compiled-routine cache from a virtual machine: compile the hot path once, retrieve it next time, skip the cost of re-deriving it.

In CoALA’s framing, procedural memory can also include the agent’s own prompt template — the system prompt is itself procedural knowledge about how to behave. This is the bit that mostly lives in the LLM’s parameters (post-training), but a layer of overridable procedural memory on top of the parameters is what lets a single base model power many specialized agents without retraining. The SOAR architecture’s “chunking” mechanism — its primary learning mechanism, compiling successful problem-solving subgoals into reusable production rules — is the cognitive-architecture precursor; agent skill libraries are the LLM-era port.

Write pattern. Slow and success-gated. A procedure is only worth caching after you’ve seen it succeed — once is a coincidence, twice is suggestive, three times is a pattern. A defensible write policy is to record candidate procedures on success (with the task description, the action sequence, and a usage counter starting at 1) and promote them to the retrieval-eligible tier only after they’ve been used or re-derived N times. Voyager actually skips the “see it three times” check and writes every successful skill; this works in their setting because Minecraft tasks are deterministic enough that one success is strong signal. In a fuzzier domain (customer-support workflows, code-review heuristics) you want a higher bar.

Read pattern. Fast retrieval at the recognition moment — when the agent receives a new task, the harness runs a similarity search against the procedural store before the model starts thinking, and injects the top matches as candidate plans. This is the JIT inlining step. The latency of the retrieval is amortized against the much larger latency of the model re-deriving the procedure from scratch.

Failure mode. Two.

  1. Brittleness to task drift. A cached procedure is a snapshot of “what worked last time,” and the environment may have changed. A deployment script that worked in March may fail in May because the build system was upgraded. The fix is graceful fallback: when a retrieved procedure fails, the agent should drop back to first-principles reasoning and (if successful) update the procedural store with the new variant. This is the agent equivalent of cache invalidation, and it has the same hard problem at the core: knowing when the cached version has gone stale.
  2. Over-generalization. A procedure that worked for one user’s deployment doesn’t necessarily work for another’s. The fix is good keying — the embedding the procedure is indexed by needs to capture the task shape, not just the surface words. “Deploy the user-service” and “deploy the billing-service” should retrieve similar procedures; “deploy the user-service to prod” and “deploy the user-service to staging” should retrieve different ones. The keying scheme is the design decision that makes or breaks the procedural store.

The Wednesday complaint from the hook — “why does the bot keep re-deriving the deployment ritual?” — is purely a missing procedural-memory layer. Episodic memory has the deployment events; semantic memory has the facts about what the deployment touches; neither of those is the same as “the sequence of tool calls that succeeded last time and should run again.” That sequence is procedural, and it wants its own home.

How the four types interact in practice

The clean theoretical separation hides a messier production reality: most agent systems back two or three of the four types with the same physical vector store, and only differentiate them by metadata tags. Mem0 stores episodic and semantic in the same Qdrant collection with type tags. Letta separates core memory (the in-context part of working memory) from recall memory (episodic) from archival memory (semantic + sometimes procedural). Voyager keeps procedural memory in a separate skill index. None of these systems perfectly reflect the cognitive taxonomy; all of them at least name the types they support, which is the actual point. The taxonomy is for vocabulary and design, not necessarily for separate physical stores.

The interactions matter more than the storage layout. Three pairs to keep straight:

  • Episodic feeds semantic via reflection. Higher-order semantic claims get extracted from accumulated episodes. The agent reads a hundred episodes about the user’s interactions and writes a semantic fact “this user is technical, prefers terse responses, dislikes emoji.” Reflection is the maintenance step that turns episodic raw material into semantic distillate. A later article in this subtree covers reflection specifically; for now: an agent that has episodic memory but no reflection has a memory that grows but doesn’t learn.
  • Episodic feeds procedural via success-attribution. When an agent succeeds at a task, the harness can scan the episodic log of the just-completed task, extract the action sequence, and write it to the procedural store. This is the cache-warming step for procedural memory and it’s typically run as a post-task hook, not in-line.
  • Semantic and procedural are read together at task start. When a new task arrives, the agent typically reads relevant semantic facts (“this user prefers terse responses”) and relevant procedural skills (“here’s how you did this kind of task last time”) in the same retrieval pass, even if they live in separate stores. The two together are the agent’s context for the task; episodic memory is consulted later, only if the agent decides it needs to.

The clean version of the read order at task start is: pin core semantic facts → retrieve top-K procedural matches → retrieve relevant episodic context if needed → run. Most failed memory designs get this read order wrong by either pulling everything every turn (token explosion) or pulling nothing and falling back on the model’s parametric knowledge (the Thursday complaint from the hook — re-asking the package manager question because the answer isn’t in any memory tier, just lost in last week’s vanished conversation buffer).

Code: all four tiers in Python

The smallest interesting build: an agent with explicit working, episodic, semantic, and procedural memory. The example uses the Anthropic SDK and Chroma so you can see what each tier’s read/write code actually looks like. Install: pip install anthropic chromadb.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import time
import json
from anthropic import Anthropic
import chromadb

client = Anthropic()
chroma = chromadb.Client()

# Storage tier (long-term, cross-session)
episodic = chroma.get_or_create_collection("episodic")
semantic: dict[str, dict] = {}  # key -> {value, confidence, ts}
procedural = chroma.get_or_create_collection("procedural")

SYSTEM_TEMPLATE = """You are a personal assistant with structured memory.

## Pinned facts (semantic, always in context)
{facts}

## Cached procedures relevant to the current task
{procedures}

## Recent relevant episodes
{episodes}

Write directives: if the user shares a durable fact, start your reply with
a single JSON line like {{"semantic": {{"key": "...", "value": "...", "confidence": 0.9}}}}.
If you successfully execute a multi-step task, end your reply with
{{"procedure": {{"description": "...", "steps": ["...", "..."]}}}}.
"""

# Working memory: the per-task scratchpad (the messages list itself)
# This lives in-context; the harness keeps it bounded.
def working_memory_for(history: list[dict], budget: int = 8) -> list[dict]:
    """Sliding window over the conversation. L1-style eviction."""
    return history[-budget:]

# Episodic write: append-only, every turn
def write_episode(session: str, role: str, text: str, importance: float = 0.5):
    episodic.add(
        documents=[text],
        metadatas=[{"session": session, "role": role,
                    "ts": time.time(), "importance": importance}],
        ids=[f"{session}-{int(time.time() * 1e6)}"],
    )

# Episodic read: recency × importance × similarity
def read_episodes(query: str, k: int = 5) -> str:
    hits = episodic.query(query_texts=[query], n_results=k * 2)
    now = time.time()
    scored = []
    for doc, meta in zip(hits["documents"][0], hits["metadatas"][0]):
        age_days = (now - meta["ts"]) / 86_400
        recency = max(0.1, 1.0 - age_days / 30)
        importance = meta.get("importance", 0.5)
        # Chroma returns similarity-ranked already; we re-rank by full score.
        scored.append((recency * importance, doc))
    top = sorted(scored, reverse=True)[:k]
    return "\n".join(f"- {d}" for _, d in top) or "(none)"

# Semantic write: keyed, conflict-aware
def write_fact(key: str, value: str, confidence: float):
    existing = semantic.get(key)
    if existing and existing["confidence"] > confidence:
        return  # don't overwrite a higher-confidence fact
    semantic[key] = {"value": value, "confidence": confidence, "ts": time.time()}

# Semantic read: dump pinned facts; in production, retrieve subset
def read_facts() -> str:
    if not semantic:
        return "(none)"
    return "\n".join(f"- {k}: {v['value']}" for k, v in semantic.items())

# Procedural write: index a successful skill
def write_procedure(description: str, steps: list[str]):
    procedural.add(
        documents=[description],
        metadatas=[{"steps": json.dumps(steps), "ts": time.time(), "uses": 1}],
        ids=[f"proc-{int(time.time() * 1e6)}"],
    )

# Procedural read: similarity search on the *task description*
def read_procedures(task: str, k: int = 3) -> str:
    if procedural.count() == 0:
        return "(none)"
    hits = procedural.query(query_texts=[task], n_results=min(k, procedural.count()))
    lines = []
    for doc, meta in zip(hits["documents"][0], hits["metadatas"][0]):
        steps = json.loads(meta["steps"])
        lines.append(f"- {doc}: {' -> '.join(steps)}")
    return "\n".join(lines)

def parse_writes(reply: str) -> str:
    """Strip leading/trailing JSON write directives and execute them."""
    lines = reply.splitlines()
    clean = []
    for line in lines:
        stripped = line.strip()
        if stripped.startswith("{") and stripped.endswith("}"):
            try:
                obj = json.loads(stripped)
                if "semantic" in obj:
                    f = obj["semantic"]
                    write_fact(f["key"], f["value"], f.get("confidence", 0.7))
                    continue
                if "procedure" in obj:
                    p = obj["procedure"]
                    write_procedure(p["description"], p["steps"])
                    continue
            except json.JSONDecodeError:
                pass
        clean.append(line)
    return "\n".join(clean).strip()

def turn(session: str, history: list[dict], user_msg: str) -> str:
    # Read pass: facts (always), procedures (task-keyed), episodes (similarity)
    system = SYSTEM_TEMPLATE.format(
        facts=read_facts(),
        procedures=read_procedures(user_msg),
        episodes=read_episodes(user_msg),
    )

    write_episode(session, "user", user_msg, importance=0.6)
    history.append({"role": "user", "content": user_msg})

    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=system,
        messages=working_memory_for(history),
    )
    raw = "".join(b.text for b in resp.content if b.type == "text")
    reply = parse_writes(raw)

    history.append({"role": "assistant", "content": reply})
    write_episode(session, "assistant", reply, importance=0.5)
    return reply

# Demo
history: list[dict] = []
print(turn("u-42", history, "I'm vegetarian. Plan a Tokyo trip for next month."))
print(turn("u-42", history, "What lunch spots should I check out?"))

Three things worth noticing. First, every tier has its own read function and its own write function, and they’re called at different times in the turn: semantic facts get read on every turn (cheap, pinned), procedures get read keyed by the task (mid-cost, narrow), episodes get read by similarity (most expensive, broadest signal). Second, the writes are conditional on a model-emitted directive — only durable facts become semantic, only successful procedures get cached. The naive alternative (“store everything”) is what breaks the design. Third, working memory is just history[-budget:] — the harness’s job is to keep the in-context message list bounded, and the durable tiers are how anything important survives the truncation.

The code is deliberately framework-light so the structure is visible. The next subtree articles will replace pieces with Mem0, Letta, or LangGraph stores, but the layering will stay the same.

Code: all four tiers in TypeScript

The TypeScript version uses the Vercel AI SDK for the model call and a hand-rolled in-memory store for each tier so the layering stays explicit. Install: npm install ai @ai-sdk/anthropic zod.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import { anthropic } from "@ai-sdk/anthropic";
import { generateText } from "ai";

// ---------- storage tiers (process-local; replace with real backends in prod) ----------
type Episode = { session: string; role: string; text: string; ts: number; importance: number };
type Fact = { value: string; confidence: number; ts: number };
type Procedure = { description: string; steps: string[]; ts: number; uses: number };

const episodic: Episode[] = [];
const semantic = new Map<string, Fact>();
const procedural: Procedure[] = [];

// toy similarity: shared-word overlap. Replace with an embedding-model call in prod.
const sim = (a: string, b: string): number => {
  const sa = new Set(a.toLowerCase().split(/\s+/));
  const sb = new Set(b.toLowerCase().split(/\s+/));
  const overlap = [...sa].filter((w) => sb.has(w)).length;
  return overlap / Math.max(sa.size, 1);
};

// ---------- per-tier read/write ----------
const writeEpisode = (session: string, role: string, text: string, importance = 0.5) =>
  episodic.push({ session, role, text, ts: Date.now(), importance });

const readEpisodes = (query: string, k = 5): string => {
  const now = Date.now();
  const scored = episodic.map((e) => {
    const ageDays = (now - e.ts) / 86_400_000;
    const recency = Math.max(0.1, 1 - ageDays / 30);
    return { e, score: sim(query, e.text) * recency * e.importance };
  });
  const top = scored.sort((a, b) => b.score - a.score).slice(0, k);
  return top.length ? top.map((s) => `- ${s.e.text}`).join("\n") : "(none)";
};

const writeFact = (key: string, value: string, confidence: number) => {
  const existing = semantic.get(key);
  if (existing && existing.confidence > confidence) return;
  semantic.set(key, { value, confidence, ts: Date.now() });
};

const readFacts = (): string =>
  semantic.size
    ? [...semantic.entries()].map(([k, v]) => `- ${k}: ${v.value}`).join("\n")
    : "(none)";

const writeProcedure = (description: string, steps: string[]) =>
  procedural.push({ description, steps, ts: Date.now(), uses: 1 });

const readProcedures = (task: string, k = 3): string => {
  if (!procedural.length) return "(none)";
  const top = procedural
    .map((p) => ({ p, score: sim(task, p.description) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k);
  return top.map(({ p }) => `- ${p.description}: ${p.steps.join(" -> ")}`).join("\n");
};

// working memory: sliding window over the conversation
const workingMemory = (history: { role: string; content: string }[], budget = 8) =>
  history.slice(-budget);

// ---------- turn ----------
const parseWrites = (reply: string): string => {
  const lines = reply.split("\n");
  const clean: string[] = [];
  for (const line of lines) {
    const t = line.trim();
    if (t.startsWith("{") && t.endsWith("}")) {
      try {
        const obj = JSON.parse(t);
        if (obj.semantic) {
          writeFact(obj.semantic.key, obj.semantic.value, obj.semantic.confidence ?? 0.7);
          continue;
        }
        if (obj.procedure) {
          writeProcedure(obj.procedure.description, obj.procedure.steps);
          continue;
        }
      } catch { /* not a write directive */ }
    }
    clean.push(line);
  }
  return clean.join("\n").trim();
};

const buildSystem = (userMsg: string) => `
You are a personal assistant with structured memory.

## Pinned facts
${readFacts()}

## Cached procedures relevant to the current task
${readProcedures(userMsg)}

## Recent relevant episodes
${readEpisodes(userMsg)}

Emit {"semantic": {"key": "...", "value": "...", "confidence": 0.9}} on durable facts.
Emit {"procedure": {"description": "...", "steps": ["..."]}} after a successful multi-step task.
`;

export const turn = async (
  session: string,
  history: { role: "user" | "assistant"; content: string }[],
  userMsg: string,
): Promise<string> => {
  writeEpisode(session, "user", userMsg, 0.6);
  history.push({ role: "user", content: userMsg });

  const { text } = await generateText({
    model: anthropic("claude-opus-4-7"),
    system: buildSystem(userMsg),
    messages: workingMemory(history),
  });

  const reply = parseWrites(text);
  history.push({ role: "assistant", content: reply });
  writeEpisode(session, "assistant", reply, 0.5);
  return reply;
};

// Demo
const history: { role: "user" | "assistant"; content: string }[] = [];
await turn("u-42", history, "I'm vegetarian. Plan a Tokyo trip for next month.");
await turn("u-42", history, "What lunch spots should I check out?");

Same four-tier shape. The Vercel AI SDK gives you provider-portable model calls; the storage layer is hand-rolled to keep the layering visible. In a production build, the in-memory stores get replaced (Postgres for facts, Qdrant/Chroma for episodes and procedures, a Redis/SQL session store for the sliding-window working memory) but the interface surface — four read functions, four write functions, distinct trigger conditions — stays put.

Trade-offs and gotchas

The biggest design mistake is collapsing two tiers into one substrate without giving them distinct read/write paths. Storing semantic facts as “just another episode” with a type=fact tag is fine as a storage decision; using the same retrieval code path for both is the mistake. Episodic memory wants recency-weighted similarity over a large noisy corpus; semantic memory wants direct keyed lookup over a small curated set. The substrate can be the same vector store with metadata filters; the code paths must be different. Mem0 nominally separates them, but in their early versions the retrieval was unified and the recall-vs-precision trade-off was noticeably worse for facts than for episodes; their fact-extraction-then-keyed-write redesign is what tightened it.

Procedural memory is the most under-implemented tier in production. Most “memory” products on the market today have episodic and semantic but no procedural — they cache observations and facts but not skills. The reason is partly that procedural memory is the hardest to design well: the keying scheme (what makes two tasks “similar enough” to share a cached procedure?), the cache-invalidation policy (when does a cached procedure go stale?), and the success-attribution heuristic (which sequence of actions actually caused the success?) are all hard. Voyager works because Minecraft is deterministic enough that these questions have easy answers; production agents in fuzzier domains have to do real work to answer them. But the payoff is large: a missing procedural tier is the difference between “the agent re-thinks every task from scratch” and “the agent gets faster over time.”

Working memory has nothing to do with persistence and the name is a constant source of confusion. “Working memory” in CoALA is the in-context scratchpad — the bytes the model is staring at right now. “Working memory” in everyday usage often means “stuff I want to remember for a few hours but not forever.” Pin which definition you’re using before the design conversation. The CoALA one is the one this article and the rest of the subtree use; the everyday one is what some product PMs mean when they say “working memory” and it’s actually a short-lived semantic memory in this taxonomy.

Don’t confuse procedural memory with the tool selection at scale problem. Tool selection is about choosing which single tool to call from a large catalog right now. Procedural memory is about retrieving a multi-step recipe of tool calls that succeeded last time. They both involve embedding-based retrieval; they answer different questions. A common architectural confusion is treating cached procedures as “really long tool descriptions” and routing them through the tool-selection layer — the embeddings end up keyed wrong (procedures are keyed by task, tools are keyed by capability) and recall@k tanks.

Read-order discipline matters more than substrate choice. A team that gets the four substrates “right” but reads everything on every turn will burn tokens and dilute attention until the system regresses. A team with a single Postgres table and a disciplined read order — pinned semantic facts always, procedural top-K keyed by task, episodes only on cache miss — will outperform. The taxonomy matters because it tells you when to read each tier; the substrate is downstream of that decision.

The four-type frame is descriptive, not prescriptive. Some workloads don’t need all four. A single-session, intra-conversation assistant needs only working memory. A customer-support bot that resets between tickets needs working + a small amount of semantic (the user’s account record). A long-running personal-assistant agent needs all four. The taxonomy’s value is to make the missing tier visible — when you skip a tier, you should know which one you skipped and why, not pretend the question doesn’t exist.

Further reading

  • The Memory Stack: A Map of AI Memory — the parent article. Today’s piece zoomed in on the four-type axis; that one keeps the full map (in-context vs storage, write/read/maintain, the OS-memory-hierarchy parallel) and is the right place to start if you came in mid-subtree.
  • Memory Write Policies: What’s Worth Remembering — the write-axis deep dive that operationalizes the write-policy frame this article keeps invoking for each cognitive type. Episodic, semantic, and procedural memory all funnel through the same four-stage pipeline; the differences are in the prompt, the dedup key, and the tier the result lands in.
  • Working Memory: Scratchpads, Blackboards, and Agent Notebooks — the structured side of the in-context tier. Where short-term memory is the conversation buffer, working memory is the explicit, typed scratchpad the harness maintains for the agent: dataflow-graph state, external notebook tools, and shared blackboards.
  • Long-Term Memory: Vector-Backed Episodic Storage — the production-grade deep dive on the episodic tier. Episode boundaries, the WAL parallel, the recency-weighted read path, and how Mem0, Letta, and LangGraph stores each draw the line between episodic and semantic storage.
  • Hierarchical Memory: Working / Episodic / Semantic Tiers — the orthogonal framing to today’s piece. The four cognitive types are what kind of information; the tiered hierarchy (core/recall/archival) is where in the cost gradient each piece lives. MemGPT’s OS-paging model is what makes the cache-hierarchy parallel operational.