jatin.blog ~ $
$ cat ai-engineering/hierarchical-memory.md

Hierarchical Memory: Working / Episodic / Semantic Tiers

Hierarchical memory: MemGPT/Letta's three-tier OS-paging model, what lives in core/recall/archival, and the promotion-demotion policies that bind them.

Jatin Bansal@blog:~/ai-engineering$ open hierarchical-memory

A long-running coding agent has every individual memory tier wired up correctly. The conversation buffer trims with head-tail eviction. The working-memory scratchpad tracks the plan and completed steps. The vector-backed long-term store holds three months of past episodes. The knowledge graph carries entities and bi-temporal validity. Every tier works in isolation. The agent still ships a regression: on turn 200 of a refactor it confidently asks the user a question whose answer is in the archival store, in the recall log, and on the user’s pinned profile card — three separate tiers, three separate copies of the same fact, none of them in the prompt the model just got. Each tier did its job. The system as a whole has no policy for which tier to read first, what to promote into the prompt, or what to demote out of it when budget tightens. That policy is what hierarchical memory is. This article is the deep dive on the architecture that makes the tiers cooperate — the MemGPT-style three-layer model, the OS-paging parallel that justifies the design, and the promotion/demotion policies that decide which bytes earn their place in context.

Opening bridge

The last three pieces walked the storage side of the memory stack. The long-term memory article built the vector-backed episodic substrate; the knowledge-graphs piece added the structural-index counterpart for entity-and-relation queries. Both are substrates — they answer the question “where does this fact physically live?” Hierarchical memory answers a different question: given the substrates, what’s the access policy? The memory stack overview named the in-context / storage boundary as the line every memory operation moves bytes across; the cognitive taxonomy split the storage side into episodic, semantic, and procedural. Today’s piece pulls those two framings together by treating the working / episodic / semantic split as a paged virtual address space — a small fast tier pinned in-context, a medium tier that pages in on demand, and a large cold tier the agent has to explicitly query. The model is the MemGPT paper (Packer et al., 2023), which is now the architectural reference for Letta and the design pattern most production memory frameworks borrow from.

Definition

Hierarchical memory is a tiered architecture where each tier has a fixed role in the read path, a defined promotion-and-demotion policy that moves bytes between tiers, and a cost profile that justifies why some content lives in one tier rather than another. Three properties make it hierarchical-specifically rather than just multi-store. First, the tiers are ordered by access cost — the fastest tier (in-context) is the smallest and most expensive per token; the slowest tier (cold storage) is the largest and cheapest per token. Second, every byte has a home tier and a defined path between tiers — promotion (cold → warm → hot) when access patterns demand it, demotion (hot → warm → cold) when budget pressure or staleness demand it. Third, the agent has first-class operations — usually tools — to move bytes between tiers, and the harness exposes those operations as a memory API rather than letting them happen as silent side effects.

What hierarchical memory is not. It is not a single store with metadata tags (that’s a flat store with categorization). It is not the same idea as a retrieval-augmented generation pipeline (RAG retrieves into context from a static corpus; hierarchical memory moves bytes between tiers of an agent’s own memory). It is not interchangeable with the cognitive taxonomy — the four cognitive types are about what kind of information (semantic vs episodic vs procedural vs working); hierarchical memory is about where in the cost hierarchy a given piece lives. The two framings are orthogonal: a semantic fact might live in the hot tier (the user’s name) or the cold tier (a rarely-needed config detail), and the placement decision is the hierarchical-memory question, not the cognitive-type question.

Intuition

The mental model that pays off is a virtual-memory subsystem with three physical layers and an explicit pager. The hot tier is the CPU’s L1 — a few KB the model sees on every call, always-resident, no retrieval cost. The warm tier is RAM — bigger, paged in by reference, cheap-to-read once retrieved but not free. The cold tier is disk — effectively unbounded, slow to query, never read on the fast path. The agent is the program, the harness is the kernel, and tool calls are the page-fault mechanism that promotes a cold-tier address into the hot tier on demand. Demotion is the inverse — the kernel evicts cold pages back to disk when RAM pressure mounts.

The reason this analogy is load-bearing rather than decorative: the engineering trade-offs that govern OS virtual memory are the same ones that govern agent memory. Why pin the page table in physical memory? Because every memory access goes through it; if it paged out, every access would itself page-fault and the system would thrash. The equivalent for an agent is the user identity, the current task description, and a small block of pinned facts — these get hit on every turn, and paying a retrieval round-trip for them would be the same kind of catastrophic thrashing. Why have a working set that’s smaller than physical memory? Because the cost of refilling the working set after a context switch is proportional to its size; an over-large working set spends most of its time being repopulated rather than used. The agent equivalent: an over-large in-context block burns tokens on every call for content the next turn won’t even read, and dilutes the model’s attention against the lost-in-the-middle effect.

Two concrete decisions force themselves on every hierarchical-memory design. The first is what gets pinned — what content is so universally relevant that it earns the always-resident tier? Second is what triggers promotion — does the agent self-page (decides via a tool call that it needs to read archival), or does the harness pre-page (decides based on a query embedding which archival rows to inject before the model runs)? The first is the working-set-size question; the second is the demand-paging-vs-prefetching question from operating systems. Both questions have defensible answers in either direction; getting them wrong is what produces either the over-pinned agent (slow and expensive) or the under-pinned agent (constantly thrashing on tool calls to look up the user’s name).

The distributed-systems parallel

Three parallels, each load-bearing.

The agent’s memory tiers are a CPU cache hierarchy. L1 is core/in-context memory: tiny, sub-nanosecond latency, always-resident. L2/L3 is recall memory: medium-sized, single-digit-millisecond latency, paged in by reference. Disk is archival memory: effectively unbounded, tens-of-milliseconds latency, accessed by explicit query. The CPU cache hierarchy’s defining decisions — separate instruction and data caches, write-through vs write-back, inclusive vs exclusive caching — all have agent-memory analogues. The cognitive taxonomy article already pointed at the four-type-to-cache-tier mapping; hierarchical memory is the operational version of that mapping — the policies that decide when each tier gets read and written, not just what each tier is for.

MemGPT’s virtual context management is OS paging, almost literally. The MemGPT paper frames the LLM context window as physical memory and the persistent stores as disk; the system uses function calls as the page-fault mechanism to move information between the two. When the agent decides it needs a fact that isn’t in the prompt, it calls a recall_search or archival_search tool — the equivalent of a page-fault that traps to the kernel. The kernel (the harness) fetches the requested page (the matching memory entries) and writes them into the function-call return, which the model then sees on the next forward pass. The trap-and-fetch loop is identical in shape to a page fault on a modern OS; the only difference is that the unit of paging is a memory entry rather than a 4KB page. This is why the OS analogy is operational rather than aesthetic — the entire design vocabulary (page tables, working sets, TLB shootdowns, demand paging, write-back) ports over with very few modifications.

Promotion and demotion policies are cache-replacement policies. When the in-context block fills up, something has to give. The same algorithm families that govern CPU cache replacement (LRU, LFU, ARC, CLOCK) apply to the in-context block — Letta’s core memory blocks are explicitly designed for agent-driven LRU-ish replacement (the agent decides what to overwrite when a new fact takes priority). The memory retrieval policies article later in this subtree goes deep on the read-time scoring formulas; hierarchical memory is where the replacement-policy version of the same problem lives. A future memory-conflict piece will close the loop on what happens when promotion races with demotion — the agent equivalent of the cache-coherence problem in multi-core systems.

The three-tier reference architecture

The reference design the rest of this article works against is the MemGPT paper’s two-tier model extended to three tiers by Letta’s productionization. The three tiers, top to bottom:

Core memory (the hot tier). A small, always-in-context block — typically a few hundred to a few thousand tokens — partitioned into named blocks. Letta’s defaults are persona (who the agent is), human (what the agent knows about the user), and a task block. Mem0’s equivalent is the pinned-facts layer. The block is always in the prompt, on every turn, and is the only tier the model can read without a tool call. The agent can self-edit it via core_memory_append and core_memory_replace tools, but the block is bounded — when it fills, the agent has to choose what to overwrite or what to demote to a slower tier.

Recall memory (the warm tier). Searchable conversation history, stored outside the prompt but queryable by tool call. Letta backs it with a database table of message turns; the agent invokes a recall_search tool (date filter or text filter) to retrieve matching turns into context for the current call. The retrieved turns are not persistent in the prompt — they’re injected as tool results for the current step and evicted naturally as the buffer rolls forward, the same way a paged-in disk page leaves the cache when the working set shifts. This is the tier that holds the long-term episodic store in its operational role — the per-turn or per-exchange log, indexed for similarity and recency.

Archival memory (the cold tier). Semantically searchable cold storage for facts, knowledge, summarized passages, and anything else that doesn’t need to be on the fast path. The agent invokes an archival_search tool (semantic similarity over an embedding index) to retrieve from this tier on demand. The archival tier is where reflection outputs land (distilled facts from a window of episodes), where imported documents go, and where the agent’s procedural-memory equivalent (cached recipes, successful action sequences) usually lives. Letta backs it with a vector store; Mem0 backs it with its primary Qdrant/Chroma collection plus the optional Mem0g graph extension.

The reason this is three tiers and not five (working, short-term, long-term episodic, long-term semantic, procedural) is that the working/short-term split is within the core tier (different blocks of the same in-context surface), and the episodic/semantic/procedural split is within the archival tier (different metadata tags on the same vector store). The tier count is about access path, not about content type — three access paths, four content types, the matrix cells get populated as the workload demands.

Promotion: how bytes get hotter

A byte is in the warm tier; the agent needs it on every turn. How does it get promoted to the hot tier? Two policies, both worth knowing.

Agent-driven promotion (the MemGPT default). The agent reads a fact via recall_search or archival_search, recognizes it as something the user will reference repeatedly this session, and emits a core_memory_append tool call to pin it into the core block. The promotion is deliberate and visible; the agent has agency over what gets hot. The trade-off is that the agent has to be prompted to do this — without explicit instructions in the system prompt (“if you find yourself repeatedly looking up the same fact, pin it to core memory”), the model won’t promote on its own, and you’ll end up with a recall-thrashing pattern where every turn calls recall_search for the same fact.

Harness-driven promotion (the prefetch pattern). The harness, before the model runs, retrieves a small number of high-scoring entries from the warm or cold tier and injects them into the prompt as if they were in core memory. The agent doesn’t have to ask; the kernel pre-pages. This is the pattern Mem0’s memory.search follows when invoked at turn start — the harness embeds the latest user query, retrieves top-K facts from the vector store, and renders them as a “## Relevant memories” block in the system prompt. The trade-off is the inverse: the agent never knows a byte is promoted (it just appears in context), which means promotion costs token budget on every turn whether the agent would have asked for it or not. This is exactly the demand-paging-vs-prefetching trade-off from OS design, and the same answer holds: hybrid wins — pre-page the cheap, high-confidence pages (the user’s name, the active task) and demand-page the speculative ones.

Two write-time companions to promotion. Pinning is the explicit “this content is hot, do not demote without asking” annotation — the persona block in Letta, the user-profile facts pinned by Mem0’s classifier, the system-prompt fragments that an agent harness renders unconditionally. Hot-set learning is the inverse — the harness observes which warm-tier entries get promoted often and auto-promotes them ahead of demand, the same way a TLB amortizes the cost of repeated address translations. The hot-set is also what should seed a fresh session’s core block — the user’s most-promoted facts from the prior session are excellent priors for the new session’s hot tier.

Demotion: how bytes get cooler

The inverse problem and the harder one. The hot tier is small; when a new fact wants in, an old fact has to leave. The choice of which old fact to demote is the cache-replacement problem applied to agent memory.

Three demotion policies, in order of how often each is right.

LRU within the hot tier. Demote the core-memory block that has been read or referenced the least recently. Cheap to implement, well-understood, the default for almost every cache. The complication for agents: “read” is hard to detect — the model attends to in-context content implicitly, so the harness doesn’t see a clear read signal. The proxy is to track which blocks the model cites in its output (or which blocks were referenced in the last K turns) and demote the rest.

Agent-driven explicit demotion. The agent decides which core block is no longer relevant and emits a core_memory_demote (or archival_insert followed by core_memory_clear) tool call. This is what Letta defaults to: the agent self-manages its core blocks because it has the best signal for what’s no longer task-relevant. The trade-off is again prompt-sensitivity — the agent needs explicit instructions and tool affordances, and a tired prompt produces an over-pinned core block that never evicts.

Importance-decay demotion. Each core block carries an importance score (assigned at write time, like the Generative Agents importance score), and demotion is by lowest-importance-first. This works well when the write-time classifier produces calibrated scores and badly when it doesn’t; the failure mode is over-confident classifiers that mark every fact as a 10/10 and the demotion becomes effectively random.

The deeper problem with demotion is the loss-of-history failure mode. If a core block gets demoted to the warm tier, the warm tier should keep it — otherwise the demotion is silent data loss. Letta’s pattern is to insert into archival memory on demotion, preserving the fact as a searchable entry even when it’s no longer in-context. The harness invariant: demotion is never a deletion; it is a tier-shift. Deletion is a separate, explicit operation, and a future memory-conflict-and-forgetting article covers when explicit deletion is the right answer.

What lives in each tier (and why)

A pragmatic guide, derived from production patterns across Letta, Mem0, and the MemGPT paper:

Core (hot) tier. The user identity (name, role, account). The active task description. A few critical preferences (“user is vegetarian, allergic to peanuts”). The agent’s persona/instructions for this conversation. The last 3-5 turns of conversation. Anything the model reads on every call. Bounded by token budget — typically 1-4K tokens in production.

Recall (warm) tier. The full conversation log for the current session and recent past sessions, indexed for date- and text-search. Tool-call results from earlier in the session that the model might need to re-examine. The full episodic store from long-term memory — every meaningful turn, retrievable by similarity and recency. Bounded by the storage backend, not by tokens — millions of entries are fine.

Archival (cold) tier. Reflection outputs: distilled facts derived from windows of episodes. Imported documents the agent should be able to consult (user-uploaded PDFs, account history, knowledge-base articles). Cached procedures from procedural memory. Older sessions’ summaries. The knowledge graph from the graph memory article often sits at this tier — entities and relations queried via traversal, not pulled into context speculatively.

The placement decision for any given fact reduces to one question: how often will the model read this on the average turn? If the answer is “every turn,” it’s a hot-tier candidate. If “occasionally,” warm tier. If “rarely but it must be findable,” cold tier. The classifier that makes this decision can be a heuristic (anything tagged persona is hot, anything tagged event is warm, anything tagged fact_with_low_priority is cold) or a model call at write time; the right choice tracks the workload’s variance. A workload where the hot-tier content is stable across sessions (a personal assistant for one user) can use heuristics; a workload where every conversation has different “always-relevant” content (a customer-support bot serving thousands of accounts) is closer to needing a learned classifier.

Code: Python — a three-tier hierarchical memory against Letta

The smallest interesting build: a Letta-backed agent with core, recall, and archival memory, where the agent self-manages tier promotion via tool calls. Install: pip install letta-client and start a local Letta server (docker run -d -p 8283:8283 letta/letta:latest). Letta is the canonical productized version of MemGPT’s hierarchical architecture; rolling your own would reproduce roughly the same code.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# pip install "letta-client>=0.5"
from letta_client import Letta

client = Letta(base_url="http://localhost:8283")

# --------- agent creation: define the hot tier explicitly ---------
# Memory blocks are the core (hot) tier. Each block has a name, a value,
# and a token budget; the agent reads and self-edits them via tools.
agent = client.agents.create(
    name="travel-assistant",
    model="anthropic/claude-opus-4-7",
    embedding="openai/text-embedding-3-small",
    memory_blocks=[
        {
            "label": "persona",
            "value": "I am a travel-planning assistant. I remember the user's "
                     "dietary needs and travel constraints across sessions.",
            "limit": 1000,  # token budget; core block won't grow past this
        },
        {
            "label": "human",
            "value": "(empty — populate as you learn about the user)",
            "limit": 2000,
        },
    ],
)

# --------- conversation: the agent self-manages the hierarchy ---------
# When the agent learns something durable, it calls core_memory_append
# (promotes to hot tier) or archival_insert (demotes to cold tier).
# When it needs a fact not in context, it calls archival_search or
# conversation_search (warm tier).
def turn(user_msg: str) -> str:
    resp = client.agents.messages.create(
        agent_id=agent.id,
        messages=[{"role": "user", "content": user_msg}],
    )
    # Letta returns the full step trace; the last assistant message is the reply.
    return next(
        m.content for m in reversed(resp.messages)
        if m.message_type == "assistant_message"
    )

# Tuesday: the agent learns durable facts.
# Watch the agent self-promote to core memory (the human block).
print(turn(
    "I'm vegetarian, allergic to peanuts, and traveling with a toddler. "
    "Planning a 5-day Lisbon trip in July."
))

# Thursday: a fresh API call, the core memory persists across sessions.
# The agent has the user's profile in its hot tier without needing to ask.
print(turn("What restaurants should we try for lunch?"))

# Months later: a question about something only the recall/archival tier has.
# The agent demand-pages by calling conversation_search or archival_search.
print(turn("When we talked about Lisbon, did I mention any specific neighborhoods?"))

Three things to notice. First, the memory blocks are explicit and bounded — the limit on each block is the hard token cap on the hot tier; the agent has to evict or summarize when it hits the cap, which is the cache-pressure signal that drives demotion. Second, the agent self-manages the hierarchy via tools — Letta auto-injects the memory-management tools (core_memory_append, core_memory_replace, archival_insert, archival_search, conversation_search) and the system prompt explaining when to use each; the OS-paging operations are first-class to the agent rather than hidden inside the harness. Third, the persistence is automatic — the persona and human blocks survive across messages.create calls because Letta stores them on the agent, the same way an OS process’s pinned pages survive context switches. The hierarchy isn’t a single-session feature; it’s the structural property that makes the agent stateful across sessions.

For deeper customization (custom blocks beyond persona/human, sleep-time agents that consolidate the archival tier in the background, multi-agent shared blocks), the Letta memory management docs are the reference. The sleep-time agent pattern — a secondary agent that runs in the background to consolidate fragmented memories, deduplicate the archival tier, and promote frequently-accessed warm entries into the hot tier — is the production version of the hot-set-learning idea, and a future sleep-time compute article in this subtree will work through it in detail.

Code: TypeScript — a hand-rolled three-tier hierarchy against LangGraph + Postgres

The TypeScript version builds the three tiers manually against LangGraph stores so the tier mechanics are visible. The pattern would apply equally to Mem0 or Letta; the value of the hand-roll is to see what the framework hides. Install: npm install @langchain/langgraph @langchain/anthropic @langchain/openai @langchain/core.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import { ChatAnthropic } from "@langchain/anthropic";
import { OpenAIEmbeddings } from "@langchain/openai";
import { InMemoryStore } from "@langchain/langgraph";
import { HumanMessage, SystemMessage, AIMessage } from "@langchain/core/messages";
import { randomUUID } from "node:crypto";

const model = new ChatAnthropic({ model: "claude-opus-4-7" });
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });

// --------- the three tiers ---------
// Hot tier: bounded, in-context, always rendered. Hand-rolled as a map of
// named blocks with explicit token budgets.
type CoreBlock = { value: string; limit: number; updatedAt: number };
const coreMemory = new Map<string, Map<string, CoreBlock>>(); // user -> blocks

// Warm tier: full conversation log + tool results, retrievable by search.
// Backed by LangGraph's InMemoryStore in this example; use the Postgres
// store in production.
const recall = new InMemoryStore({
  index: { dims: 1536, embed: embeddings },
});

// Cold tier: distilled facts, summaries, imported documents.
const archival = new InMemoryStore({
  index: { dims: 1536, embed: embeddings },
});

// --------- core-tier rendering (always-in-context) ---------
const renderCore = (user: string): string => {
  const blocks = coreMemory.get(user);
  if (!blocks || blocks.size === 0) return "";
  const sections = [...blocks.entries()].map(
    ([label, b]) => `### ${label}\n${b.value}`,
  );
  return "## Core memory (hot tier — always in context)\n" + sections.join("\n\n");
};

// --------- promotion: warm/cold -> hot ---------
// Two tools the agent can call. The harness routes the tool calls to these.
const coreAppend = (user: string, label: string, content: string): string => {
  const blocks = coreMemory.get(user) ?? new Map();
  const existing = blocks.get(label);
  const next = existing ? existing.value + "\n" + content : content;
  // Bound check — this is the cache-pressure signal that should trigger demotion.
  const limit = existing?.limit ?? 1000;
  if (next.length > limit * 4 /* rough char-to-token */) {
    // Demote oldest content from this block to archival, then write the new value.
    archivalInsert(user, `[demoted from core/${label}] ${existing!.value}`);
    blocks.set(label, { value: content, limit, updatedAt: Date.now() });
  } else {
    blocks.set(label, { value: next, limit, updatedAt: Date.now() });
  }
  coreMemory.set(user, blocks);
  return `Pinned to core/${label}`;
};

// --------- demotion: hot -> cold ---------
const archivalInsert = async (user: string, content: string): Promise<string> => {
  await archival.put(["archival", user], randomUUID(), { content, ts: Date.now() });
  return "Archived";
};

// --------- warm-tier and cold-tier reads (paged in on demand) ---------
const recallSearch = async (user: string, query: string, k = 5): Promise<string[]> => {
  const hits = await recall.search(["recall", user], { query, limit: k });
  return hits.map((h) => (h.value as { content: string }).content);
};

const archivalSearch = async (user: string, query: string, k = 5): Promise<string[]> => {
  const hits = await archival.search(["archival", user], { query, limit: k });
  return hits.map((h) => (h.value as { content: string }).content);
};

// --------- the turn ---------
// The system prompt teaches the agent the hierarchy and how to use the tools.
// Anthropic's tool-calling format makes the promotion/demotion explicit.
const SYSTEM_TEMPLATE = (user: string) => `
You are a travel-planning assistant with hierarchical memory.

You have three memory tiers, ordered by access cost:
1. **Core memory (hot)**: always in your context. Read it freely.
2. **Recall memory (warm)**: full conversation log. Call recall_search to query.
3. **Archival memory (cold)**: durable facts, summaries. Call archival_search to query.

Promotion: when a fact is clearly durable and you'll need it every turn, call
core_memory_append to pin it. Don't pin trivia.

Demotion: when a core block fills, the harness auto-demotes the oldest content
to archival. Don't try to manage this manually.

${renderCore(user)}
`;

export const turn = async (user: string, userMsg: string): Promise<string> => {
  // Log the user turn to recall (the warm tier; full conversation history).
  await recall.put(["recall", user], randomUUID(), {
    content: `[user] ${userMsg}`,
    ts: Date.now(),
  });

  // Run the model with the rendered core in the system prompt.
  // In production, you'd bind tools here; for brevity the agent's tool-call
  // dispatch is omitted (it would route core_memory_append, recall_search,
  // archival_search, archival_insert calls to the functions above).
  const resp = await model.invoke([
    new SystemMessage(SYSTEM_TEMPLATE(user)),
    new HumanMessage(userMsg),
  ]);

  const reply = resp.content as string;
  await recall.put(["recall", user], randomUUID(), {
    content: `[assistant] ${reply}`,
    ts: Date.now(),
  });
  return reply;
};

// Demo: the same pattern as Letta — agent learns, pins to core, persists.
await turn("u-42", "I'm vegetarian, allergic to peanuts, traveling with a toddler.");
// Agent should have called core_memory_append("human", "vegetarian, peanut-allergic, traveling with toddler")
await turn("u-42", "What lunch spots in Lisbon should we try?");
// Agent reads the core block in the system prompt; no tool call needed for the dietary facts.

The same three-tier shape, hand-rolled. The pieces that Letta does for you and that this code exposes: the renderCore step that always-injects the hot tier, the coreAppend step that handles the eviction-and-demote on cache pressure, the recallSearch/archivalSearch as explicit tool-shaped operations. A real production version would add the missing dispatch loop (the tool-result-handling loop covered in the tool use article and the agent loop article), proper tokenization for the limit check, and a sleep-time pass that consolidates the warm tier into the archival tier on a schedule.

The interesting design decision in the hand-roll is the automatic demotion on hot-tier overflow: when coreAppend would push the block past its limit, the harness automatically writes the existing content to archival before overwriting. This is the cache-eviction-with-write-back pattern from OS design — never lose data on demotion, always write through to the slower tier. Skipping this step is the single most common bug I’ve seen in hand-rolled hierarchical-memory implementations; the agent prunes its core block, the demoted content is gone forever, and a future query fails because the supposedly-archived fact never made it to archival.

Trade-offs, failure modes, and gotchas

The over-pinned-core failure mode. A core memory block with no demotion policy grows until it hits the agent’s context limit, then crashes the next call with context_length_exceeded. Or it stays under the limit but balloons to the point where the lost-in-the-middle effect degrades attention on the parts of the core block that actually matter. The fix is to keep each core block small (a few hundred tokens) and to bias toward demotion in the agent’s instructions (“pin only what you’ll read every turn; archive everything else”). Letta’s per-block limit enforces this structurally; hand-rolled implementations should mirror it.

The under-pinned-core failure mode. The inverse. The agent demotes too aggressively; the core block is empty; every turn pages in the same handful of facts from the warm tier via recall_search. Same problem as TLB thrashing — the working set doesn’t fit in the hot tier and the system spends most of its budget on page faults. Diagnostic signal: the agent calls the same recall_search query three turns in a row; that’s a hot-set candidate that should have been pinned. Mitigation: explicit instruction to pin (in the system prompt) when the agent notices repeated reads, plus the harness-driven hot-set-learning pattern that observes recall_search hit patterns and auto-promotes.

The promotion-without-demotion silent leak. An agent that promotes facts to core but never demotes is a memory leak. By session 50 the core block holds 47 facts and the model is reading 4K tokens of stale preferences on every turn. The fix is invariant-driven demotion: every promotion must check the block’s budget and demote the oldest entry if the new entry would breach it. Skipping this check is the single most common hand-rolled-hierarchical-memory bug.

The promotion-as-deletion bug. The agent promotes a fact to core and the harness deletes it from the warm tier (figuring “it’s hot now, we don’t need it twice”). On the next demotion, the fact disappears from the system entirely. Always keep the warm-tier copy as a fallback; promotion is projection, not move. The cost is a small storage overhead; the benefit is the system stays recoverable when demotion happens.

The agent-can’t-find-its-own-memory failure mode. Letta and similar systems give the agent explicit tools (recall_search, archival_search) and explicit instructions on when to use them. Without those instructions, the agent doesn’t know it has a warm or cold tier and behaves as if its context window is all there is. Symptoms: the agent confidently makes up information that’s in the archival store but never queried. The fix is in the prompt — explicitly enumerate the tools, describe what each tier holds, and (critically) include a sentence like “If you’re uncertain about a fact, check archival_search before answering.” Without that line, the agent will hallucinate over its own memory.

The cross-tier coherence problem. The same fact lives in multiple tiers (the user’s name is in the core block and in the recall log and in archival). When the user updates the fact (“actually, call me Jay”), all three copies need to be updated, or the next archival_search will return the stale version and the model gets confused. The defensible pattern is write-through with versioning — every write updates all copies with the same version stamp, and reads prefer the highest version. The alternative (write-back with eventual consistency) leads to the same kinds of bugs distributed systems have spent decades fighting.

The lost-in-the-middle leakage at the tier boundary. When the harness pre-pages warm-tier entries into the core block, it has to decide where in the prompt they go. The hot blocks are typically at the top of the system prompt (primacy bias); the pre-paged entries can either join them (privileging the new content) or sit at the bottom of the user prompt (privileging the user’s turn). The middle of the system prompt is the worst place — that’s exactly where the lost-in-the-middle effect bites. Pre-paged content goes at the bottom of the system prompt, just above the user turn, or right inside the user turn as a “## Relevant context” block.

The cache-invalidation problem at the boundary. Hot-tier content changes (the agent updates a fact); the rendered system prompt changes; the prompt cache invalidates. If you’ve cached a long system prompt and you’re updating the human block on every turn, you’re paying the cache-miss cost on every turn. The mitigation is to put dynamic core blocks after the static portions of the system prompt and after any cache breakpoints — the hot tier is naturally cache-cold, but the persona and instructions can be cache-warm if they precede the dynamic block. The working-memory-scratchpads piece flagged the same issue; the cache-aware-pattern is the same here.

The sleep-time-compute is not free. Letta’s sleep-time agents and similar background-consolidation patterns run model calls between user turns to deduplicate, consolidate, and reorganize the archival tier. This works beautifully when the agent has idle time between turns and breaks when it doesn’t — a high-throughput multi-tenant system that never goes idle will queue sleep-time tasks indefinitely. Plan the sleep-time-compute budget the same way you’d plan a background-GC budget for a JVM: it has to fit in the available headroom, or it competes with user-facing latency for the same compute.

The framework-vs-roll-your-own decision. Letta gives you the three tiers, the tool surface, the sleep-time agents, and the multi-tenant scoping. The cost is opinionatedness — the framework’s defaults (block names, demotion thresholds, sleep-time triggers) may not fit your workload. Mem0 with the graph extension is simpler to bolt onto an existing vector store but doesn’t ship the hot-tier-as-first-class-object idea. Hand-rolling on LangGraph stores gives you maximum control and the maximum surface area for bugs. The forthcoming production memory frameworks article will work the comparison matrix; today’s heuristic is use Letta when the hierarchy is the defining feature of your agent, use Mem0 when the vector store is and the hierarchy is secondary, hand-roll when neither framework’s defaults match your workload.

Further reading

  • MemGPT: Towards LLMs as Operating Systems (Packer, Wooders, Lin, Fang, Patil, Stoica, Gonzalez, 2023) — the paper that made the OS-paging analogy concrete and built the working implementation. §3 (main vs external context) and §4 (function-call-based memory management) are the load-bearing sections; §5 (multi-session chat evaluation) is the empirical case for the design.
  • Letta — Understanding memory management — the production reference for the three-tier (core/recall/archival) architecture. The cleanest docs for how a real agent runtime exposes hierarchical memory as a first-class API surface, including the tool-call protocol the agent uses to self-manage tiers.
  • Letta — Sleep-time agents — the background-consolidation pattern that turns the three-tier model from a per-turn read-write hierarchy into a continuously-improving memory system. The closest published reference to “what happens at the boundary between the warm and cold tiers when the agent is idle.”
  • Cognitive Architectures for Language Agents (Sumers, Yao, Narasimhan, Griffiths, 2023) — the CoALA paper. §3 (memory) is the conceptual ground for the working/episodic/semantic/procedural split that the hierarchy organizes; §4 (decision procedures) is where the read-time selection across tiers gets framed in the broader agent-architecture context.
  • Long-Term Memory: Vector-Backed Episodic Storage — the substrate the warm tier sits on top of. The episodic store is the underlying physical layer; hierarchical memory is the access policy that decides when to read from it and what to promote into context.
  • Knowledge Graphs as Structured Memory — the structural-index counterpart to the warm tier. In a graph-augmented hierarchical-memory system, the graph typically sits at the cold tier as a queryable knowledge base; the agent traverses it via tool calls the same way it reads archival_search.
  • The Cognitive Taxonomy: Semantic, Episodic, Procedural — the upstream framing. The four cognitive types are about what kind of information; hierarchical memory is about where in the cost hierarchy the information lives. The two framings are orthogonal and both are load-bearing.
  • Memory Write Policies: What’s Worth Remembering — the upstream write-axis piece. Hierarchical memory decides where an admitted memory lands; the write policy decides whether and in what shape the memory enters the system at all. Stage 4 (persist) of the write pipeline is exactly where the hot/warm/cold tier choice gets made.