$ cat ai-engineering/memory-stack-overview.md

The Memory Stack: A Map of AI Memory

A map of AI agent memory: in-context vs storage, the four cognitive types, the write/read/maintain axes, and why memory isn't RAG with a longer leash.

Jatin Bansal@blog:~/ai-engineering$ open memory-stack-overview

A customer-support agent ships on a Monday. By Friday it has answered 4,000 tickets and a product manager asks the obvious question: “If I email it again next week, will it remember me?” The honest answer is no — the agent has a context window, not memory. Every conversation starts from the same blank slate; every preference the user shared on Tuesday is gone by Thursday. The fix is not “make the context window bigger.” A 1M-token window with no persistence is still amnesia in slow motion. The fix is to build a memory stack — a deliberate layering of in-context working memory, durable storage, write policies, read policies, and maintenance — and to recognize that this is a different engineering surface from RAG, even though it overlaps with RAG at one of its layers. This article is the map of that stack; the rest of the Memory subtree fills it in.

Opening bridge

The last four articles built out the Agents subtree from the ground up — the agent loop, planning vs reactive control flow, multi-agent orchestration, and tool selection at scale. Every one of those pieces assumed that the conversation history is the agent’s memory: the loop replays prior turns, the supervisor’s context carries forward, the tool catalog gets re-prefilled each call. That assumption breaks the moment the agent has to remember anything across a session boundary, across a context-window flush, or across an outage. Today’s piece names what was missing — the memory stack as a first-class design surface — and sets up the next ~20 articles that work through each layer in detail. The context engineering article is the closest neighbor: it covered which tokens go into a single call’s window; this piece covers where state lives between calls.

What memory actually is, at the right level of abstraction

A working definition that survives engineering use: memory is any state the agent reads on input, writes on output, and maintains across the gaps between calls. Three verbs, all load-bearing. Read is the retrieval pass that turns persistent state into in-context tokens. Write is the deliberate decision to commit something from the current call into persistent state. Maintain is everything that happens between reads and writes — consolidation, compaction, deletion, deduplication, conflict resolution.

What memory is not, at this level of abstraction: it is not “the model’s parameters.” Parametric knowledge — what the model learned during pretraining — is a substrate the memory system can call, but it isn’t itself memory in the engineering sense, because you can’t write to it during a conversation. Fine-tuning is a slow, batched write path that lives outside the operational loop. The memory we’re talking about is the part you build, deploy, and own.

The clearest formal frame is CoALA — Cognitive Architectures for Language Agents (Sumers, Yao et al., 2023), the Princeton paper that ported cognitive-science vocabulary into LLM engineering and named the four memory types every agent eventually grows: working, episodic, semantic, procedural. The shape of the contemporary stack hasn’t moved much since CoALA — what’s changed is the depth of each layer’s tooling and the empirical evaluations. The recent Memory in the Age of AI Agents survey (Hu et al., December 2025) cataloged the field at 47 authors’ worth of breadth and proposes the next finer-grained taxonomy (factual / experiential / working from the function angle; token-level / parametric / latent from the form angle). For an engineering audience, CoALA’s four-type frame is still the right one to start from; the survey is where you go when you need vocabulary for the edge cases.

The first axis: in-context vs storage

Every memory cell in the system lives on one side of a hard line. In-context memory is the bytes in the next call’s prompt window. Storage memory is everything else — files, databases, vectors, graphs, the disk in your laptop, the row in your Postgres table. The boundary is the API call to the model.

This boundary matters because every other dimension of the memory stack reduces to “how do we move bytes across it.” The retrieval pass moves storage→in-context. The write pass moves in-context→storage. Compaction shrinks the in-context side without touching storage; consolidation rewrites storage without touching the in-context side. Once you see the boundary, the architecture of every memory framework on the market becomes legible — Mem0, Letta, Zep, LangGraph’s stores, OpenAI’s Sessions, MemGPT — they’re all moving bytes across the same line, with different policies for what to move, when, and how.

In-context has the read/write characteristics of registers: zero latency to access (the model is reading them anyway), full availability, byte-limited (the context window), and discarded on call boundary by default. Storage has the characteristics of disk: durable, effectively unbounded, but every byte costs a retrieval round-trip and a token of in-context budget when you decide to load it.

The second axis: the four cognitive types

The cognitive-science taxonomy that has stuck — and that the next article in this subtree unpacks in detail — is the four-type frame from CoALA and the classical psychology lineage it draws from. Each type has a distinct write pattern, a distinct read pattern, and a distinct failure mode.

Working memory. The scratchpad the agent uses during the current task: intermediate reasoning, partial plans, tool results not yet committed elsewhere. Lives in-context by default. Equivalent to the call stack of a running program. The working-memory article goes deep on the substrates — scratchpads, typed state objects, external notebooks, blackboards — and how each interacts with the conversation buffer.
Episodic memory. “What happened” — a record of past interactions, observations, and outcomes, indexed by time and context. The user’s message from last Tuesday is an episode; the tool result you logged at 3:14am is an episode. Writes are appends; reads are recall queries. The closest distributed-systems analogue is a write-ahead log — append-only, ordered, replay-able.
Semantic memory. “What is true” — generalized facts extracted from episodes (or seeded from external knowledge). “The user prefers dark mode” is semantic; “the API key rotates every 90 days” is semantic. Writes are distillations from episodes (or direct facts from outside the loop); reads are factual lookups. The analogue is a key-value store or a knowledge graph; the read pattern matters more than the write pattern, because semantic facts are read constantly.
Procedural memory. “How to do X” — cached skill, learned patterns of action that succeed for a class of tasks. “To onboard a customer, do A then B then C” is procedural. Writes are slow (a procedure is only worth caching after you’ve seen it succeed a few times); reads happen at the moment the agent recognizes a familiar task shape. The analogue is a compiled-code cache: the JIT compiler caches hot paths because dispatching them is faster than recompiling. Procedural memory caches successful action sequences because reasoning them out from scratch is expensive.

The four types are not orthogonal in implementation — most production systems back episodic and semantic with the same vector store and just tag entries differently. They are orthogonal in purpose, and getting them confused is a common source of design pain. Storing every tool result as a “fact” in semantic memory bloats the knowledge base with episodes that should have been logged elsewhere. Storing distilled preferences as raw episodes makes them hard to find because they look identical to all the other observations the agent recorded.

The third axis: read, write, maintain

The verbs are where the engineering hides. Every memory layer must answer all three independently.

Write policies answer “what’s worth remembering?” The naive answer (“everything”) is the same trap as logging at DEBUG in production: storage isn’t free, the write itself usually requires a model call to distill, and retrieval gets noisier as the corpus fills with junk. A defensible write policy is a classifier — explicit (a model call that returns a should_remember: boolean), heuristic (rules about which message types are stored), or learned. The memory write policies article is the deep dive on the four-stage write pipeline (triage, extract, dedupe, persist) and the journal-and-checkpoint pattern that lets the read path stay fast as the corpus grows; today’s frame: a memory system without an explicit write policy will silently devolve into “store every assistant turn” and the retrieval quality will collapse around the 1k-episode mark. The write-amplification parallel from storage engineering is exact — every byte you write is a byte you’ll have to read, compact, and pay for later.

Read policies answer “what’s relevant now?” The naive answer (“cosine-similarity top-K”) is the same RAG default that the retrieval cascade subtree spent eight articles improving on. Memory retrieval has the same surface but stricter signals available: recency (the episode from yesterday is usually more relevant than one from six months ago), importance (the episode you flagged at write time as load-bearing should outrank the one you flagged as routine), and use-frequency (the fact the agent has retrieved 50 times is probably worth re-ranking up). The Generative Agents paper (Park et al., 2023) introduced the formulation that has become canonical: score = recency × importance × similarity, with each term in its own [0, 1] range. A later article on memory retrieval policies will work the formula in detail; today’s frame: a memory system that uses only similarity will retrieve like RAG and forget like a goldfish.

Maintenance answers “what happens between calls?” — compaction, consolidation, reflection, deletion, conflict resolution, embedding-drift handling. This is the layer least often built and most often missed. Without maintenance, the memory grows monotonically, stale facts pile up against fresh facts of the same kind (“user prefers light mode” recorded in March, “user prefers dark mode” recorded in May, no resolution), and the recall quality decays. The closest distributed-systems analogue is database garbage collection and log compaction: the system can run fine for a long time without it and then catastrophically not. Reflection, sleep-time compute, and conflict/forgetting — three later articles in this subtree — are each a slice of maintenance.

The distributed-systems parallel

The full stack maps onto the memory hierarchy of a multi-process operating system, with one twist that matters.

The context window is L1 cache — fastest access, smallest, evicted at the end of every call.
The session store (short-term memory — the conversation buffer, scratchpad, KV state of the current task) is L2 / DRAM — survives within a session, doesn’t survive a restart. LangGraph calls this short-term memory and backs it with a checkpointer; OpenAI calls it a Session and backs it with Redis, SQLAlchemy, MongoDB, or Dapr.
The episodic store is disk — durable, vector-indexed, an append-only log of observations the agent has recorded.
The semantic store is the filesystem’s metadata layer — extracted, normalized, queryable facts about entities the agent cares about. In Zep and Graphiti this is a temporal knowledge graph; in Mem0 it’s a vector store with structured tags; in Letta this is “archival memory.”
The procedural store is the compiled-code cache — recorded successful action sequences that the agent can replay when the task shape matches. The SOAR cognitive architecture called this “chunking”; modern agent systems are starting to call it “skill libraries.”

MemGPT (Packer et al., 2023) is the most direct version of this parallel: the paper explicitly frames its design as “LLMs as operating systems,” with virtual context management that pages between fast in-context memory and slow external storage the same way a kernel pages between RAM and disk. Letta is the production descendant of MemGPT and still bills itself as an agent runtime where memory tiers are first-class concepts. The hierarchical-memory article later in this subtree walks through the three-tier (core/recall/archival) model in detail. The OS framing isn’t a metaphor; the engineering trade-offs map one-for-one.

The twist that matters: the model’s retrieval is lossy and stochastic, not deterministic. A CPU asking for a page from disk gets the same bytes every time. An agent asking its memory for “what does this user usually want” gets a top-K vector search result that depends on the query embedding, the corpus state, and the cutoff threshold. The lossy retrieval is the same difference that made tool selection at scale different from service discovery — and it shows up in memory too, harder. The fix is the same: don’t expect retrieval to do what indexing should do; treat memory like RAG with extra signals, not like a database lookup.

Memory vs RAG — the distinction that keeps getting collapsed

A reasonable engineer will read all of this and ask: “isn’t memory just RAG with state?” The honest answer is no, and the easiest way to see why is to enumerate the differences.

RAG indexes a static corpus; memory indexes a growing stream. A RAG index is built once (or periodically) over a known body of documents. A memory store has a write happening on every relevant turn, and the write distribution shifts as the conversation evolves. The retrieval pass over a memory store has to weigh recency in a way RAG doesn’t, because the marginal document being added today is more likely to be relevant than the one from a year ago.

RAG retrieves to ground a single response; memory retrieves to maintain identity across responses. A RAG hit on the right Wikipedia article makes the next sentence correct. A memory hit on the right user-preference fact makes the next thousand responses feel like they’re talking to the same agent. The metric of success is different: RAG measures recall@k and faithfulness on a per-query basis; memory measures multi-session consistency, the user’s perception of being known, the agent’s ability to avoid re-asking what it was told yesterday. Most of the canonical memory benchmarks — LongMemEval (Wu et al., 2024), LoCoMo (Maharana et al., 2024) — measure this multi-session-recall axis explicitly. Standard RAG eval suites don’t.

RAG retrieval happens against a corpus the system didn’t author; memory retrieval happens against a corpus the system itself wrote. This sounds minor and is the source of half the failure modes. When the agent wrote the corpus, the agent has to distill before it writes (so the corpus doesn’t bloat with every utterance), reflect across writes (so the corpus develops higher-order claims), and resolve conflicts (so contradictory writes don’t poison retrieval). RAG inherits its corpus quality; memory builds its corpus quality.

Memory has a maintenance burden RAG doesn’t. A RAG index doesn’t need reflection, doesn’t need salience scoring, doesn’t need to handle the user contradicting themselves. A memory system needs all of these and the subtle thing is that they’re not optional: a memory system without them works fine for the first 100 sessions and then degrades.

That said: RAG is a component of memory. The episodic and semantic stores typically use the same retrieval cascade (vector search, hybrid, reranking, maybe query transformation) that pure-RAG systems use. The retrieval stack from this curriculum’s first half is the engine; the memory stack adds the layers above it (write policy, maintenance, multi-store coordination) that make it work for the agent use case. Calling memory “just RAG” is like calling a database “just a B-tree”: technically a B-tree is in there, but the system that wraps it is what makes it useful.

Where each piece fits — a tour of the next 20 articles

The Memory subtree will work through the stack in order. The map below is the table of contents for the next ~3 weeks of writing; it’s worth knowing the shape because each future article will assume you’ve internalized this one.

The conceptual frame. This article, then the cognitive-taxonomy article that goes deeper on the four memory types and the cache-hierarchy parallel.
The in-context layer. Short-term memory (the conversation buffer), working memory (scratchpads, blackboards, agent notebooks for state mid-task).
The storage layer. Long-term memory backed by vectors; knowledge graphs as the structured-memory alternative; hierarchical memory for the tiered version of all of it.
The write path. Write policies, episode segmentation and salience, reflection — turning observations into distilled, scored, higher-order entries.
The compaction path. Summarization and context compression, sleep-time compute for batch consolidation — the maintenance layer the OS analogy made explicit.
The read path. Retrieval policies (recency/importance/similarity formulas), temporal reasoning and provenance, conflict and forgetting and embedding drift.
The cross-cutting concerns. Procedural memory, cross-session identity, multi-agent shared memory, privacy and multi-tenancy.
Evaluation and production. Memory benchmarks and custom evals; production frameworks (MemGPT/Letta, Mem0, Zep, Graphiti) with a comparison matrix.

If a layer of this stack isn’t in your current system, you’ve made an implicit decision to do without it. Sometimes that’s the right call (a single-turn customer-support bot that resets on every session genuinely doesn’t need episodic memory). Sometimes it’s a bug that will surface six months from now (the “why doesn’t the agent remember me?” PM question).

Code: a minimal three-tier memory in Python

The smallest interesting build is a three-tier memory — in-context buffer, episodic vector store, semantic key-value store — with explicit read/write policies. The code below is deliberately framework-light so you can see what each layer is doing; later articles will replace pieces with Mem0, Letta, Zep, or LangGraph stores. Install: pip install anthropic chromadb. Uses the Anthropic SDK for the model and Chroma as the local vector store.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import time
import json
from anthropic import Anthropic
import chromadb

client = Anthropic()
chroma = chromadb.Client()
episodic = chroma.get_or_create_collection("episodic")
semantic: dict[str, str] = {}  # key-value: simplest semantic store

SYSTEM = (
    "You are a personal assistant with memory. When the user shares a durable "
    "preference, fact about themselves, or ongoing project, return a JSON line "
    "of the form {\"remember\": {\"key\": <slug>, \"value\": <fact>}} BEFORE your "
    "reply text. Only emit such a line when the fact is durable; do not store "
    "every utterance."
)

def write_episodic(session_id: str, role: str, text: str):
    """Append-only episodic log. Every meaningful turn goes here."""
    episodic.add(
        documents=[text],
        metadatas=[{"session": session_id, "role": role, "ts": time.time()}],
        ids=[f"{session_id}-{int(time.time()*1000)}"],
    )

def write_semantic(key: str, value: str):
    """Distilled, durable facts only. The model decides what qualifies."""
    semantic[key] = value

def read_memory(query: str, k: int = 5) -> str:
    """Compose the memory block injected before the user message."""
    # episodic: top-k by similarity, recency-weighted
    hits = episodic.query(query_texts=[query], n_results=k)
    now = time.time()
    scored = []
    for doc, meta in zip(hits["documents"][0], hits["metadatas"][0]):
        age_days = (now - meta["ts"]) / 86400
        recency = max(0.1, 1.0 - age_days / 30)  # decay over ~30 days
        scored.append((recency, doc))
    episodes = "\n".join(f"- {d}" for _, d in sorted(scored, reverse=True))

    # semantic: dump every known fact (small store; in prod, retrieve subset)
    facts = "\n".join(f"- {k}: {v}" for k, v in semantic.items())

    return f"## Known facts about user\n{facts or '(none)'}\n\n## Recent episodes\n{episodes or '(none)'}"

def parse_write(text: str) -> tuple[str, str | None]:
    """Strip the optional {'remember': ...} line and return (reply, write)."""
    lines = text.split("\n", 1)
    if lines[0].strip().startswith("{") and "remember" in lines[0]:
        try:
            payload = json.loads(lines[0])
            return (lines[1] if len(lines) > 1 else ""), payload["remember"]
        except (json.JSONDecodeError, KeyError):
            return text, None
    return text, None

def turn(session_id: str, user_msg: str) -> str:
    memory_block = read_memory(user_msg)
    write_episodic(session_id, "user", user_msg)

    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=SYSTEM + "\n\n" + memory_block,
        messages=[{"role": "user", "content": user_msg}],
    )
    assistant_raw = "".join(b.text for b in resp.content if b.type == "text")
    reply, write = parse_write(assistant_raw)

    if write:
        write_semantic(write["key"], write["value"])
    write_episodic(session_id, "assistant", reply)
    return reply

# Day 1
print(turn("user-42", "I'm vegetarian and I'm planning a trip to Tokyo next month."))
# Day 7, new session
print(turn("user-42", "Can you suggest some places for lunch?"))

Three things to notice. First, write is an explicit decision, not a side effect — the model emits a {"remember": ...} line only when it judges the fact durable, and the parser separates that from the reply. A real system replaces the heuristic with a calibrated classifier or a structured-output schema; the principle is the same. Second, read is recency-weighted similarity, not raw cosine — this is the simplest viable version of the Generative Agents formula and already produces noticeably better recall than top-K alone. Third, the semantic and episodic stores are physically different objects — the semantic dict, the episodic Chroma collection — because they have different write policies, read patterns, and growth rates. Conflating them is the most common mistake.

What this code does not do, deliberately: no reflection (no batch consolidation of episodes into higher-order claims), no compaction (the dict grows forever), no conflict resolution (writing the same key twice silently overwrites), no provenance (you can’t trace a stored fact back to the episode it came from). Each of those is a future article in this subtree.

Code: a minimal three-tier memory in TypeScript

The TypeScript version uses LangGraph for the short-term checkpointer and a hand-rolled in-memory long-term store, to make the layering explicit. Install: npm install @langchain/langgraph @langchain/anthropic @langchain/core zod.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import { ChatAnthropic } from "@langchain/anthropic";
import {
  StateGraph,
  MemorySaver,
  Annotation,
  END,
} from "@langchain/langgraph";
import { HumanMessage, AIMessage, SystemMessage } from "@langchain/core/messages";

const model = new ChatAnthropic({ model: "claude-opus-4-7" });

// Storage tier (long-term, cross-session)
type Episode = { sessionId: string; role: string; text: string; ts: number };
const episodic: Episode[] = [];
const semantic = new Map<string, string>();

const recencyWeighted = (query: string, k: number): Episode[] => {
  // toy similarity: shared-word overlap; replace with a real embedding call in prod
  const qWords = new Set(query.toLowerCase().split(/\s+/));
  const now = Date.now();
  return episodic
    .map((e) => {
      const eWords = new Set(e.text.toLowerCase().split(/\s+/));
      const sim = [...qWords].filter((w) => eWords.has(w)).length / Math.max(qWords.size, 1);
      const ageDays = (now - e.ts) / 86_400_000;
      const recency = Math.max(0.1, 1.0 - ageDays / 30);
      return { e, score: sim * recency };
    })
    .sort((a, b) => b.score - a.score)
    .slice(0, k)
    .map(({ e }) => e);
};

const readMemory = (query: string): string => {
  const facts = [...semantic.entries()].map(([k, v]) => `- ${k}: ${v}`).join("\n");
  const episodes = recencyWeighted(query, 5).map((e) => `- ${e.text}`).join("\n");
  return `## Known facts\n${facts || "(none)"}\n\n## Recent episodes\n${episodes || "(none)"}`;
};

// Short-term tier (in-context, per-session) — LangGraph checkpointer handles this
const StateAnnotation = Annotation.Root({
  messages: Annotation<(HumanMessage | AIMessage | SystemMessage)[]>({
    reducer: (a, b) => [...a, ...b],
    default: () => [],
  }),
  sessionId: Annotation<string>,
});

const callModel = async (state: typeof StateAnnotation.State) => {
  const userMsg = state.messages.at(-1)!.content as string;
  const memBlock = readMemory(userMsg);
  const sys = new SystemMessage(
    `You are a personal assistant with memory. When the user shares a durable preference or fact, emit a JSON line {"remember": {"key":"...","value":"..."}} BEFORE your reply.\n\n${memBlock}`,
  );

  episodic.push({ sessionId: state.sessionId, role: "user", text: userMsg, ts: Date.now() });
  const resp = await model.invoke([sys, ...state.messages]);
  const text = resp.content as string;

  // parse the optional write directive
  const firstLine = text.split("\n", 1)[0].trim();
  if (firstLine.startsWith("{") && firstLine.includes("remember")) {
    try {
      const { remember } = JSON.parse(firstLine);
      if (remember?.key && remember?.value) semantic.set(remember.key, remember.value);
    } catch { /* ignore malformed write directive */ }
  }
  episodic.push({ sessionId: state.sessionId, role: "assistant", text, ts: Date.now() });
  return { messages: [resp] };
};

const graph = new StateGraph(StateAnnotation)
  .addNode("model", callModel)
  .addEdge("__start__", "model")
  .addEdge("model", END)
  .compile({ checkpointer: new MemorySaver() });

// Day 1, session "thread-42"
await graph.invoke(
  { messages: [new HumanMessage("I'm vegetarian and planning Tokyo next month.")], sessionId: "user-42" },
  { configurable: { thread_id: "thread-42" } },
);
// Day 7, fresh thread — short-term memory is gone, but long-term remains
await graph.invoke(
  { messages: [new HumanMessage("Where should I get lunch tomorrow?")], sessionId: "user-42" },
  { configurable: { thread_id: "thread-43" } },
);

Same three-tier shape. LangGraph’s MemorySaver checkpointer handles the short-term tier (thread-scoped state) and gets discarded when the thread ends; the episodic array and semantic map are the long-term tier and survive across threads. In production, LangGraph’s Postgres checkpointer and BaseStore replace the in-memory placeholders, but the layering is the same. The TypeScript version uses a toy word-overlap similarity to keep the example free of an embedding-model call; the Python version’s Chroma is the closer-to-production shape.

Trade-offs and gotchas

The biggest mistake is building memory before the use case demands it. A single-turn assistant doesn’t need a memory stack; it needs a good prompt. A multi-turn assistant inside one session needs short-term memory only. Episodic and semantic stores earn their keep when the use case is cross-session and the value of remembering exceeds the cost of building the stack and the failure-mode tail. Most teams I’ve seen reach for Mem0 or Letta as their first move and would have been better served by getting RAG right first; most teams I’ve seen avoid memory entirely have a product that feels lobotomized.

The second biggest mistake is letting memory grow without a write policy. Storage is cheap; retrieval over noisy storage is expensive. Every byte written is a byte that competes for top-K slots at read time. A memory system without an explicit “is this worth remembering?” gate degrades silently and slowly. The Mem0 paper (Chhikara et al., 2024) makes this point empirically: their fact-extraction pipeline filters aggressively at write time and that’s where most of their accuracy lift comes from.

Eventual consistency is the default, and you usually don’t want strong consistency anyway. A multi-agent system writing to a shared memory store will not give you read-your-writes guarantees unless you build them. For most agent workloads this is fine — the model is lossy about retrieval anyway, and a 100ms write-to-read lag is invisible. For the cases where you need strong consistency (financial transactions, audit logs), you should not be using vector-backed memory for those facts; you should be using a regular database with the memory layer as a cache in front of it.

Embedding drift is a slow, invisible failure mode. If you upgrade the embedding model that backs your episodic store, every previously-stored vector becomes stale relative to new query vectors. The retrieval will degrade with no error log, no alert, and no obvious symptom — just slowly worse recall. Pin your embedding model version in your build, re-embed the entire store on any change, and treat embedding-model upgrades the same way you’d treat a schema migration. The embedding drift gotcha from the text-embeddings article is the underlying mechanic; memory is where it bites hardest because the corpus is constantly growing.

“Memory” is overloaded vocabulary; pin which type before arguing about it. Half the disagreements about memory architecture I’ve seen on agent teams are people using “memory” to mean different things — one person means “the session buffer,” another means “the user-profile facts,” a third means “everything the agent has ever observed.” Get the four-type frame on the whiteboard and ask which type each disagreement is about; the disagreement usually evaporates.

The model’s context window is not a substitute for a memory layer. A 1M-token window does not give you memory; it gives you a bigger working set. Without persistence, every byte in the window is gone on the next call. The context engineering article made the working-set argument; the memory framing is the natural next step — the working set is what fits, memory is what survives.

What to read next

The Cognitive Taxonomy: Semantic, Episodic, Procedural — the direct sequel. This article named the four types; that one goes deep on each: working memory as L1, episodic as the write-ahead log, semantic as the keyed fact store, procedural as the JIT-compiled-routine cache. Runnable code for all four tiers.
Memory Write Policies: What’s Worth Remembering — the write-axis deep dive. The four-stage pipeline (triage, extract, dedupe, persist), the journal-and-checkpoint pattern that lets the read path stay fast as the corpus grows, and the hot-path-vs-deferred-vs-background trade-off that decides when the write pipeline runs relative to the user-facing turn.
Working Memory: Scratchpads, Blackboards, and Agent Notebooks — the second half of the in-context tier. Where short-term memory is the conversation buffer, working memory is the explicit, structured scratchpad the harness maintains separately: typed state objects, external notebooks, and shared blackboards for multi-agent systems.
Long-Term Memory: Vector-Backed Episodic Storage — the first storage-tier article. Where the in-context tier discards on call boundary, the storage tier persists across sessions; this piece works through episode boundaries, the WAL parallel, recency-weighted retrieval, and the production frameworks (Mem0, Letta, LangGraph stores) that productionize it.