jatin.blog ~ $
$ cat ai-engineering/memory-retrieval-policies.md

Memory Retrieval Policies: Recency, Relevance, Importance

Memory retrieval policies: the recency-relevance-importance rerank, exponential decay, read-time boosts, and the LRU/LFU/ARC cache parallel for agents.

Jatin Bansal@blog:~/ai-engineering$ open memory-retrieval-policies

A customer-support agent has a year of history on every user. The write policy is doing its job; the episode boundaries are crisp; reflections are firing on accumulated importance; the sleep-time consolidator keeps the store tidy. The store is healthy. The retrieval still feels off. The user asks “what did I tell you about my refund policy preferences?” and the top-K hits are a textually similar discussion of someone else’s refund policy from six months ago — a thread that has zero relevance to the current question but happens to share half a dozen keywords. The right episode is in the store; it’s just sitting at rank 12, buried under cosine-similar noise. The substrate is fine. The retrieval policy is broken. This is the layer where every memory subsystem either earns its keep or quietly turns into expensive cosine-only RAG against a temporally-confused corpus. Recency, relevance, and importance — and the read-time mechanics that combine them — are the topic. This article is the deep dive on the read path.

Opening bridge

Yesterday’s piece closed the maintenance axis of the memory subsystem: the sleep-time tier that runs reflection, compression, and embedding-regeneration in the background so the hot path doesn’t pay the latency. Everything before it — the write policy, segmentation, salience scoring, reflection, compression — has worked on the write side of the memory subsystem: what enters the store, in what shape, how it’s enriched after it lands. Today’s article finally turns to the read side. Every previous piece in the subtree has named the Generative Agents α·recency + β·importance + γ·similarity formula in passing and deferred the deep dive. This is the deep dive. The frame the rest of the article works from: the read path is a cache-replacement policy at heart — every retrieval is choosing which episodes to surface and which to leave behind, and the algorithm families that govern CPU cache replacement (LRU, LFU, ARC) have honest analogues in agent memory, with the same trade-offs and the same failure modes.

Definition

A memory retrieval policy is the algorithm that, given a query and an episodic store, produces a ranked list of episodes ordered by their expected usefulness to the current call. Five operations every policy has to ground out, even if it ducks any of them by default. First, recall — fetch a coarse candidate set from the store, typically by vector similarity over the query embedding. Second, score — combine multiple signals (recency, relevance, importance, conditional task fit) into a single scalar per candidate. Third, rerank — sort by the combined score and trim to a small final budget that fits the prompt’s context budget. Fourth, update — write back any read-driven state changes (last-read timestamps, retrieval counts, importance boosts). Fifth, budget — bound the work so a single retrieval doesn’t spend more cycles than the model call it’s serving.

Three properties separate a policy from “we cosine-similarity-top-K-ed it.” First, it is multi-signal — at least recency, importance, and similarity, often more (recurrence count, source-tier weight, task tag); a pure cosine score is not a policy, it is a baseline. Second, it is normalized — each signal is rescaled to a comparable range (typically [0, 1]) before combination, so a tight similarity-score distribution can’t drown out a wide importance-score distribution. Third, it is budget-aware — every component (the recall K, the rerank cost, any cross-encoder pass) is bounded, so the retrieval can serve a sub-second hot-path call without growing unboundedly with the store size.

What a retrieval policy is not. It is not chunking (chunking is upstream, the unit-of-storage decision). It is not the admission gate (admission is also upstream, the unit-of-write decision). It is not reranking in the RAG sense (those rerankers are cross-encoders for static-corpus search; memory rerank is multi-signal blending for a temporally-active store). It is not hierarchical paging (paging decides which tier to read; the retrieval policy operates within a tier once the read lands there). All four interact, and a defensible memory subsystem has explicit answers to all of them, but the retrieval policy is the layer this article scopes to: the read-time scoring function that turns a healthy store into a useful one.

Intuition

The mental model that pays off is cache replacement applied to an agent’s working set. The episodic store is the L3 (cold storage); the top-K returned from a retrieval is the L1 (the bytes that actually land in the prompt). Every retrieval is choosing which subset of L3 to promote into L1 for the next call. The question “which K should we promote?” is the same shape as “which K cache lines do we keep on a cache miss?” — and the answer is, of course, the K most likely to be useful. The cache literature has spent fifty years working out what “most likely to be useful” means in the presence of three competing signals: how recently was this last accessed (recency, the LRU answer), how often is it accessed (frequency, the LFU answer), and how important is it to the current workload (priority, the priority-cache answer). All three signals are present in agent memory and all three have well-understood failure modes.

Concretely, the Generative Agents formulation is the canonical port:

text
1
score(episode | query) = α · recency(episode) + β · importance(episode) + γ · similarity(query, episode)

with each term normalized to [0, 1] and the weights typically equal (α = β = γ = 1) as the published baseline. The three terms map cleanly onto the three cache signals: recency is LRU, importance is the priority field, and similarity (cosine distance to the query embedding) is the workload-fit term that a static cache doesn’t have but a query-driven memory does. The shape is also the shape of hybrid search — multiple signals combined with linear weights — and the read-time reconciliation problem is recognizably the same: how do you keep one signal from drowning out the others?

Two design questions force themselves on every implementation. The first is how is each signal normalized? Cosine similarity for a strong embedding model lives in a narrow band (most hits cluster between 0.5 and 0.8); raw importance scores from a salience prompt are integer-valued in [1, 10]; recency, even after exponential decay, can collapse to near-zero for any episode more than a few weeks old. Combining the three signals without normalization lets whichever has the widest dynamic range dominate the others, and the score becomes effectively single-signal. The second is how does the policy update read-driven state? Every successful retrieval changes the store — the episode’s last-read timestamp advances, its retrieval count increments, sometimes its importance is boosted. Without these updates the recency term decays monotonically and a useful episode falls out of the top-K after a few weeks; with them, the rerank becomes adaptive to actual usage patterns.

The distributed-systems parallel — LRU, LFU, and the ARC compromise

Three honest parallels, each load-bearing.

LRU is the recency term. The least-recently-used cache replacement policy evicts the cache line that hasn’t been touched the longest. It’s the dominant default in OS page caches, database buffer pools, and CDN edges for a reason — temporal locality is the strongest single signal in most workloads. The agent-memory port is direct: the recency term in the retrieval score is the LRU signal applied to episode replacement, scoring episodes by how recently they were last accessed rather than when they were originally written. The same caveat from OS theory applies: pure LRU is fragile to one-shot scans — a sequential read of cold data evicts the entire warm working set, and the cache thrashes until the workload returns to normal. The agent equivalent: a user opens a long-old session, the recency term refreshes every episode in that session, the top-K is now dominated by week-old episodes, and unrelated current queries retrieve poorly. The mitigation in OS caches is LRU/K or LRU-2 — only count an access if it’s the second access within a recent window, so a one-shot scan doesn’t refresh the cache. The agent equivalent is the recurrence-count boost in MemoryBank and A-MEM — refresh the recency only after the second retrieval, not the first.

LFU is the importance and recurrence terms combined. Least-frequently-used replacement evicts the cache line accessed the fewest times. It is the right answer for workloads with strong frequency signals (a few hot pages used constantly, many cold pages used rarely) and the wrong answer for workloads with strong recency signals (last week’s hot pages displace this week’s even though the workload has shifted). The agent-memory port: the importance term and the retrieval-count boost together produce an LFU-like ranking — episodes that have proved useful many times accumulate weight; episodes used once decay. Pure LFU has the same fragility: a fact that was very important a year ago but is now stale (the user moved cities, changed jobs, switched stacks) keeps its weight because nothing in the policy demotes it. The mitigation in cache theory is aging — divide accumulated frequency by elapsed time so old hot entries decay gracefully. The agent equivalent is the recency × importance product the Generative Agents formula computes — importance times recency, not importance plus recency, in the variant some production systems use.

ARC is the compromise that 2026 production frameworks have converged on. Megiddo and Modha’s Adaptive Replacement Cache (USENIX FAST 2003) is the algorithm that ports the LRU/LFU trade-off into a single self-tuning policy. ARC keeps four lists — recently accessed (T1), frequently accessed (T2), and ghost lists of recently evicted recent (B1) and frequently (B2) entries. Hits on the ghost lists signal whether the workload’s recency-vs-frequency balance is shifting, and ARC re-weights the LRU and LFU components accordingly. The agent-memory port shows up in the weights of the Generative Agents formula: a static α = β = γ = 1 is a fixed LRU/LFU balance; a workload-aware policy tunes the weights based on observed retrieval quality, exactly the way ARC tunes its split. Production frameworks rarely ship a full ARC implementation, but the idea — weights are workload-specific and a single static setting is wrong for diverse users — is what motivates the per-user weight tuning patterns that Mem0 and Letta both expose.

The deeper observation: every retrieval-policy choice you make is a position on the LRU-vs-LFU spectrum, with similarity as a third axis the classical cache literature doesn’t have. Knowing where on the spectrum you sit makes the failure modes legible. A pure-cosine retrieval ignores both LRU and LFU; a recency-only policy is pure LRU and fragile to scans; an importance-only policy is pure LFU and fragile to staleness; the equal-weights baseline is a uniform LRU/LFU compromise that beats every single-signal policy on every published memory benchmark — and a thoughtfully-tuned policy beats the equal-weights baseline.

The Generative Agents formula in detail

The formulation from Park et al. (2023), §A.1, is the canonical reference, and knowing each piece in detail is what makes the implementations downstream legible.

Recency. Exponential decay over the time since the episode was last retrieved (not just last written). The paper uses a decay factor of 0.995 per sandbox hour, which sounds modest until you compound it: after 24 sandbox hours an episode’s recency term is 0.995^24 ≈ 0.886; after a week (168 hours) it is 0.995^168 ≈ 0.43; after a month it is 0.995^720 ≈ 0.027. The choice of decay constant is the load-bearing knob. Too aggressive (0.99 per hour) and any episode older than a few days drops out of the top-K regardless of importance; too gentle (0.999 per hour) and recency becomes effectively constant and the policy collapses to importance-plus-similarity. A defensible default for real-world wall-clock time is a half-life in the 7-to-30-day range, set explicitly — recency = exp(-Δt / τ) with τ chosen so that recency(τ) = 0.5.

Importance. The anchored 1-10 salience score, normalized to [0, 1] by dividing by 10. The dependency on the salience scorer producing a distributed score (and not the all-7s collapse) is what makes this term informative; the salience-collapse failure mode named in the segmentation piece is also the most common way the retrieval-side rerank silently regresses to similarity-plus-recency.

Similarity. Cosine similarity between the query embedding and the episode embedding, no normalization needed since cosine is already in [-1, 1] and effectively in [0, 1] for reasonable embedding models. The choice of embedding model matters here in the same way it matters everywhere — the text-embeddings article named the trade-offs; the relevant gotcha for retrieval policies specifically is that some embedding models have very tight cosine distributions (everything between 0.7 and 0.9 for any reasonable query/document pair), which silently squashes the similarity term’s dynamic range. Mitigation: rescale the observed cosine band to [0, 1] by min-max normalization over the recalled candidates, not over the global range.

Combination. Linear weighted sum: α·recency + β·importance + γ·similarity. The paper’s default is equal weights and the empirical result is that equal weights outperform any single-signal baseline by a wide margin. Workload-specific tuning matters at the margins — a customer-support agent where importance is the strongest signal benefits from β > α, γ; a coding agent where similarity-to-current-code matters most benefits from γ > α, β; a personal-assistant agent in long conversation where the most-recent preferences supersede older ones benefits from α > β, γ. But the variance from tuning is dwarfed by the variance between equal-weights and any single-signal baseline — get to equal-weights first, tune second.

Top-K selection. After computing the combined score for every candidate from the coarse recall (typically K_recall = 20-50), sort and take the top N (typically N_final = 3-10). The 4-to-1 oversampling ratio between recall and final is the standard pattern from the reranking article — recall has to be wide enough that the right episode is in the candidate set, even if it’s not the highest cosine hit; the rerank uses the additional signals to find it.

Read-driven state updates — the LRU/LFU hybrid in action

Every successful retrieval should write back to the store. Three updates worth doing on every retrieval:

Update last_read_at. Each top-N episode gets its last_read_at advanced to the current time, which refreshes its recency term for the next retrieval. This is the LRU update, and it’s what keeps useful episodes from decaying out of the top-K just because they were written long ago. An episode written six months ago but read fifty times in the last week should have a recency term of nearly 1, not nearly 0 — the last_read_at update is what makes that the case.

Increment retrieval_count. A monotonic counter on each episode, incremented on every successful retrieval. Optionally folded into the importance term as importance' = importance + δ · log(1 + retrieval_count) (logarithmic to avoid letting a few highly-retrieved episodes dominate the entire policy). This is the LFU contribution — frequency-of-use is a meaningful signal that the static write-time importance doesn’t capture.

Conditional importance boost. When an episode is retrieved and the agent’s subsequent answer cites it (or simply succeeds against a downstream evaluation), boost the importance. When an episode is retrieved and the answer goes badly, optionally demote. This is the read-driven salience update MemoryBank introduced — the Ebbinghaus-curve-inspired strengthening of memories that prove useful. It is the closest agent analogue to the brain’s reactivation-strengthens-memory consolidation pattern, and the closest cache analogue to the priority boost in priority-aware cache replacement.

The discipline that separates a useful read-driven update from a counterproductive one is bounded magnitude. A 5% boost per retrieval keeps the policy responsive to actual usage; a 50% boost per retrieval makes the importance term swing wildly on every read and the rerank becomes unstable. The Generative Agents paper uses a 0.995 per-hour decay on the recency term and effectively a constant 1.0 per-retrieval refresh on last_read_at; MemoryBank uses an Ebbinghaus-shaped S = S_0 · exp(-t/η) where η grows with retrieval count, so each retrieval extends the half-life rather than ratcheting up a separate score. Both are defensible; the wrong answer is to compound multiple unbounded boosts and watch the importance term diverge.

Hybrid retrieval — what Mem0 and Letta actually ship

The 2026 production retrieval stacks have converged on a hybrid model that goes beyond the three-signal Generative Agents formula in two ways.

Multi-signal recall, not just multi-signal rerank. The recall stage itself runs multiple retrieval modes in parallel and merges the results. Mem0’s three-channel recall — semantic similarity over embeddings, BM25 keyword match, entity-graph traversal — pulls candidates from each channel, merges them via reciprocal rank fusion, and then runs the multi-signal rerank over the merged set. The win is reach — semantic similarity alone misses episodes that share concepts but not vocabulary; keyword match alone misses episodes that use synonyms; graph traversal alone misses anything not explicitly entity-tagged. The combined channel covers all three failure modes, and the win is measurable: Mem0’s reported numbers attribute a non-trivial share of their LOCOMO benchmark lift to the channel fusion specifically, on top of the rerank.

Tier-aware scoring. A retrieval against a hierarchical memory store doesn’t just retrieve from the warm tier — it considers what’s already in the hot tier (the always-resident core block), avoids returning duplicates of pinned content, and assigns different score weights to candidates from different tiers. A reflection-tier hit (a higher-order belief derived from many episodes) typically gets a higher importance weight than a raw-episode hit; a structured-fact-tier hit gets a higher similarity weight (because the structured fact’s text is denser per-token). The shape is the same scoring formula with tier-conditional weights, and it’s the layer that makes “we retrieve from all tiers in one call” cohere.

The pattern that wins in production is channel fusion at recall, tier-aware multi-signal rerank at scoring, read-driven state updates on write-back. Each piece is decoupled enough to instrument and tune independently; the combined system gracefully degrades when any one component fails.

Code: Python — a normalize-and-blend retrieval pass

The smallest interesting build: a retrieval policy that runs cosine recall, normalizes the three signals, blends them with tunable weights, updates the read-driven state, and returns the top-N. Uses Chroma as the substrate. Install: pip install chromadb numpy.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# pip install chromadb numpy
import math
import time
import uuid
from dataclasses import dataclass

import chromadb
import numpy as np

chroma = chromadb.PersistentClient(path="./memory_store")
episodes = chroma.get_or_create_collection("episodes")


@dataclass
class RetrievalPolicy:
    # Half-life of the recency term in seconds (default ~14 days).
    recency_half_life_s: float = 14 * 86_400
    # Linear weights on the three signals. Equal-weights is the Park et al. default.
    alpha_recency: float = 1.0
    beta_importance: float = 1.0
    gamma_similarity: float = 1.0
    # Logarithmic retrieval-count boost on importance. 0 disables LFU.
    delta_retrieval_count: float = 0.05
    # Recall/final fan-out.
    recall_k: int = 20
    final_n: int = 5
    # Minimum gap between top-K refreshes of `last_read_at`. Without this every
    # retrieval flattens the recency term across the entire candidate set.
    last_read_refresh_floor_s: float = 60.0


def _decay(age_s: float, half_life_s: float) -> float:
    """Exponential decay normalized so that decay(half_life) == 0.5."""
    return math.pow(0.5, age_s / half_life_s)


def _minmax(xs: list[float]) -> list[float]:
    """Min-max normalize a list of scores to [0, 1]. Robust to constant vectors."""
    if not xs:
        return []
    lo, hi = min(xs), max(xs)
    if hi - lo < 1e-9:
        return [0.5 for _ in xs]
    return [(x - lo) / (hi - lo) for x in xs]


def retrieve(
    user: str,
    query: str,
    policy: RetrievalPolicy | None = None,
) -> list[dict]:
    """Two-stage retrieval: cosine recall, then multi-signal rerank with read-driven update."""
    policy = policy or RetrievalPolicy()
    now = time.time()

    # ----- 1. Coarse recall (cosine similarity, tenant-scoped) -----
    hits = episodes.query(
        query_texts=[query],
        n_results=policy.recall_k,
        where={"user": user},
    )
    if not hits["ids"][0]:
        return []

    ids = hits["ids"][0]
    docs = hits["documents"][0]
    metas = hits["metadatas"][0]
    distances = hits["distances"][0]

    # ----- 2. Compute each signal -----
    # Recency: exponential decay since last read.
    age_seconds = [now - (m.get("last_read", m.get("ts", now))) for m in metas]
    recency_raw = [
        _decay(age, policy.recency_half_life_s) for age in age_seconds
    ]

    # Importance: normalized salience (already in [0, 1] from write path).
    # LFU boost via log(1 + retrieval_count).
    importance_raw = [
        m.get("importance", 0.5)
        + policy.delta_retrieval_count
        * math.log1p(m.get("retrieval_count", 0))
        for m in metas
    ]

    # Similarity: convert Chroma L2 distance to a [0, 1] proxy, then min-max
    # across the candidate set so a tight cosine band doesn't squash the term.
    similarity_raw = [1.0 / (1.0 + d) for d in distances]

    # ----- 3. Per-candidate normalization (critical for blending) -----
    recency = _minmax(recency_raw)
    importance = _minmax(importance_raw)
    similarity = _minmax(similarity_raw)

    # ----- 4. Linear blend -----
    scored: list[tuple[float, str, str, dict]] = []
    for i, (eid, doc, meta) in enumerate(zip(ids, docs, metas)):
        score = (
            policy.alpha_recency * recency[i]
            + policy.beta_importance * importance[i]
            + policy.gamma_similarity * similarity[i]
        )
        scored.append((score, eid, doc, meta))
    scored.sort(reverse=True, key=lambda t: t[0])

    top = scored[: policy.final_n]

    # ----- 5. Read-driven state update (LRU/LFU/Ebbinghaus contributions) -----
    updates_ids: list[str] = []
    updates_metas: list[dict] = []
    for _, eid, _, meta in top:
        last_read = meta.get("last_read", 0.0)
        # Refresh-floor: don't bump last_read more often than the floor allows.
        # Prevents a hot query loop from flattening the recency curve.
        if (now - last_read) < policy.last_read_refresh_floor_s:
            continue
        meta = dict(meta)
        meta["last_read"] = now
        meta["retrieval_count"] = int(meta.get("retrieval_count", 0)) + 1
        updates_ids.append(eid)
        updates_metas.append(meta)
    if updates_ids:
        episodes.update(ids=updates_ids, metadatas=updates_metas)

    return [
        {
            "id": eid,
            "text": doc,
            "score": score,
            "meta": meta,
        }
        for score, eid, doc, meta in top
    ]


# ----- write-side helper (just for the demo) -----
def write_episode(
    user: str, text: str, importance: float = 0.5, type_: str = "event"
) -> str:
    eid = str(uuid.uuid4())
    now = time.time()
    episodes.add(
        ids=[eid],
        documents=[text],
        metadatas=[
            {
                "user": user,
                "type": type_,
                "importance": importance,
                "ts": now,
                "last_read": now,
                "retrieval_count": 0,
            }
        ],
    )
    return eid


if __name__ == "__main__":
    # Seed three episodes with different importance + age profiles.
    write_episode(
        "alice", "User strongly prefers Postgres for any relational workload.", 0.9
    )
    # Older but mundane:
    eid_old = write_episode(
        "alice", "User asked what time it was in UTC.", 0.2
    )
    # Force the "old" episode to look stale:
    episodes.update(
        ids=[eid_old],
        metadatas=[
            {
                "user": "alice",
                "type": "event",
                "importance": 0.2,
                "ts": time.time() - 30 * 86_400,
                "last_read": time.time() - 30 * 86_400,
                "retrieval_count": 0,
            }
        ],
    )
    write_episode(
        "alice", "User mentioned they are deploying to Cloudflare Pages.", 0.7
    )

    # Equal-weights retrieve against a Postgres-flavored query.
    results = retrieve("alice", "What database does the user like?")
    for r in results:
        print(f"score={r['score']:.3f}  imp={r['meta']['importance']:.2f}  text={r['text']}")

Four things to notice. First, per-candidate min-max normalization is what makes the three signals comparable — without it, a tight cosine band (everything in [0.6, 0.8]) would have far less dynamic range than the recency term (which can span the full [0, 1]), and the policy would collapse to recency-plus-importance. Second, the last_read_refresh_floor_s parameter prevents the hot-query thrash — without it, a loop that re-runs the same retrieval every second would refresh every top-N episode’s recency on every iteration, and the recency term would become constant across the entire candidate set after a few iterations. Third, the delta_retrieval_count knob is the LFU contribution and defaults small (0.05) — large values turn the policy into pure LFU and lock in old hot episodes; the logarithmic shape ensures even highly-retrieved episodes don’t dominate. Fourth, the policy is a dataclass because every parameter wants to be tunable per user or per workload; hardcoding the weights or the half-life into the function body is the most common reason production retrieval systems can’t be calibrated against a benchmark.

Code: TypeScript — same shape against LangGraph stores

The TypeScript port uses LangGraph’s store API and runs the same blend-and-normalize logic. Install: pnpm add @langchain/langgraph @langchain/openai zod.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
// pnpm add @langchain/langgraph @langchain/openai zod
import { InMemoryStore } from "@langchain/langgraph";
import { OpenAIEmbeddings } from "@langchain/openai";
import { randomUUID } from "node:crypto";

const store = new InMemoryStore({
  index: {
    dims: 1536,
    embed: new OpenAIEmbeddings({ model: "text-embedding-3-small" }),
  },
});

type Episode = {
  text: string;
  type: "preference" | "fact" | "event";
  importance: number; // [0, 1]
  ts: number;
  lastRead: number;
  retrievalCount: number;
};

type Policy = {
  recencyHalfLifeMs: number;
  alpha: number;
  beta: number;
  gamma: number;
  delta: number;
  recallK: number;
  finalN: number;
  refreshFloorMs: number;
};

const defaultPolicy: Policy = {
  recencyHalfLifeMs: 14 * 86_400 * 1000,
  alpha: 1,
  beta: 1,
  gamma: 1,
  delta: 0.05,
  recallK: 20,
  finalN: 5,
  refreshFloorMs: 60_000,
};

const decay = (ageMs: number, halfLifeMs: number) =>
  Math.pow(0.5, ageMs / halfLifeMs);

const minmax = (xs: number[]): number[] => {
  if (xs.length === 0) return [];
  const lo = Math.min(...xs);
  const hi = Math.max(...xs);
  if (hi - lo < 1e-9) return xs.map(() => 0.5);
  return xs.map((x) => (x - lo) / (hi - lo));
};

export async function writeEpisode(
  user: string,
  text: string,
  importance = 0.5,
  type: Episode["type"] = "event",
): Promise<string> {
  const id = randomUUID();
  const now = Date.now();
  const ep: Episode = {
    text,
    type,
    importance,
    ts: now,
    lastRead: now,
    retrievalCount: 0,
  };
  await store.put(["episodes", user], id, ep);
  return id;
}

export async function retrieve(
  user: string,
  query: string,
  policy: Policy = defaultPolicy,
): Promise<Array<{ id: string; text: string; score: number; episode: Episode }>> {
  const now = Date.now();
  const hits = await store.search(["episodes", user], {
    query,
    limit: policy.recallK,
  });
  if (hits.length === 0) return [];

  // Signal 1 — recency (exponential decay since last read).
  const recencyRaw = hits.map((h) => {
    const ep = h.value as Episode;
    return decay(now - ep.lastRead, policy.recencyHalfLifeMs);
  });

  // Signal 2 — importance plus log-LFU boost on retrieval count.
  const importanceRaw = hits.map((h) => {
    const ep = h.value as Episode;
    return ep.importance + policy.delta * Math.log1p(ep.retrievalCount);
  });

  // Signal 3 — similarity from the store's score field, already roughly [0,1].
  const similarityRaw = hits.map((h) => h.score ?? 0);

  // Per-candidate normalization so the three signals are comparable.
  const recency = minmax(recencyRaw);
  const importance = minmax(importanceRaw);
  const similarity = minmax(similarityRaw);

  const scored = hits.map((h, i) => {
    const ep = h.value as Episode;
    const score =
      policy.alpha * recency[i] +
      policy.beta * importance[i] +
      policy.gamma * similarity[i];
    return { id: h.key, text: ep.text, score, episode: ep };
  });
  scored.sort((a, b) => b.score - a.score);
  const top = scored.slice(0, policy.finalN);

  // Read-driven update with refresh floor to avoid hot-query thrash.
  await Promise.all(
    top.map(async ({ id, episode }) => {
      if (now - episode.lastRead < policy.refreshFloorMs) return;
      await store.put(["episodes", user], id, {
        ...episode,
        lastRead: now,
        retrievalCount: episode.retrievalCount + 1,
      });
    }),
  );

  return top;
}

The architectural shape is identical: coarse recall via the store’s semantic search, per-candidate min-max normalization across the three signals, linear blend with tunable weights, refresh-floor-gated read-driven update. The TypeScript version uses Math.log1p for the LFU boost (numerically stable for retrieval counts of zero, which would otherwise inflate to log(0)); the Python version’s math.log1p does the same. Both implementations are <150 LoC because the retrieval policy is fundamentally a small, dense piece of logic — the complexity lives in tuning the parameters, not in implementing the formula.

Calibrating the weights — what changes when you tune

The default α = β = γ = 1 is the published Generative Agents baseline and a defensible starting point. The cases where tuning pays off are predictable and worth knowing:

Personal-assistant agents with strong preference drift. A user who changes preferences over time (moved cities, changed jobs, switched tech stacks) benefits from α > β, γ — the recency term needs more weight so superseded preferences fall out of the top-K. The Generative Agents simulated-day setup has this property and the paper’s measured retrievals reflect it. A defensible setting is α = 1.5, β = 1.0, γ = 1.0 with a 7-day half-life.

Customer-support agents with stable customer history. A user whose preferences are mostly fixed (the support contract, the product configuration, the prior ticket history) benefits from β > α, γ — the importance-weighted facts (the SLA tier, the technical-contact identity) need to surface even when the current query doesn’t textually match them. A defensible setting is α = 1.0, β = 1.5, γ = 1.0 with a longer (30-day) half-life.

Coding agents with codebase context. A coding agent answering questions against the current codebase benefits from γ > α, β — the similarity to the current code matters more than how recently the agent last looked at it; a similar-pattern episode from six months ago is often more useful than a textually distant episode from yesterday. A defensible setting is α = 0.5, β = 1.0, γ = 1.5 with a long half-life.

Long-running planning agents. An agent decomposing a multi-step plan benefits from a higher anti-recency boost on episodes from prior planning sessions on the same task — the relevant content is what the agent decided in the prior session, even if that was a week ago and other content has been written since. The mechanism is to filter the recall by task-tag before the rerank, then run equal-weights on the filtered candidates.

The general discipline: the workload’s temporal structure (drift over time vs stability) determines the recency weight; the workload’s importance distribution (a few critical facts vs uniformly mundane history) determines the importance weight; the workload’s retrieval-query distribution (lexically rich vs semantically tight) determines the similarity weight. Tuning blind without measuring against a benchmark almost never improves the policy; tuning against a held-out retrieval set from the actual workload almost always does. Treat the weights the same way you’d treat any hyperparameter — set them empirically.

Anti-recency, anti-importance, and the staleness gates

Two non-obvious refinements show up in production retrieval policies and are worth knowing by name.

The anti-recency boost. Sometimes you want to surface old content — for example, when answering “what did we decide three months ago about X?” the answer is by construction old, and the recency term will push it down. The fix is to recognize the temporal intent of the query and invert the recency term for those queries — multiply by (1 - recency) rather than by recency — or, more often, drop the recency term entirely (set α = 0) when the query carries explicit temporal markers (“yesterday,” “last week,” “three months ago,” “originally”). Mem0 and Letta both expose temporal-intent flags on retrieval for this reason.

The staleness gate. Some content has explicit validity intervals — a knowledge-graph edge with valid_until set, a reflection whose source episodes have been superseded, a fact tagged with a confidence that has been demoted by a sleep-time consolidator. The retrieval policy should filter out (not just down-weight) any candidate that fails an explicit staleness check before scoring. A down-weighted stale candidate can still surface above a fresh one if the down-weight is small relative to the other signals; a hard filter is what enforces the staleness boundary.

The diversity penalty. A top-K dominated by five near-duplicates of the same episode wastes 4/5ths of the prompt budget on redundant content. The mitigation is a maximum-marginal-relevance (MMR) pass after the score-and-rank — for each candidate, deduct a fraction of its score for each higher-ranked candidate it’s textually similar to, so near-duplicates are spread across the top-K rather than concentrated at the top. The same MMR trick the reranking article named for static-corpus search applies here without modification.

The cost-of-rerank trade-off

The full multi-signal rerank costs more than cosine-only retrieval — not in dollars (the rerank is pure CPU math after the embeddings are computed) but in latency (a few extra milliseconds per retrieval) and in implementation complexity (the read-driven state updates have to be transactional with the read). Knowing when the cost pays off is part of the policy choice.

When the rerank pays off. A store with > 1000 episodes per user. A workload with non-trivial temporal structure (multi-month conversations, changing preferences). An agent where retrieval quality is the user-facing failure mode. All three together — long-running personal-assistant or customer-support agents are the canonical fit, and the published memory benchmarks (LongMemEval, LoCoMo) measure exactly this kind of workload — the memory evaluation article covers the per-category breakdown and the precision/recall harness for tuning rerank weights against an actual workload.

When cosine-only is fine. Stores with < 200 episodes per user (the noise floor is low; the right episode is almost always the highest cosine hit anyway). Short-session workloads where the in-session context buffer covers most retrievals. Stateless or near-stateless agents that don’t accumulate per-user history. Adding the multi-signal rerank to these workloads costs latency for marginal benefit; the equal-weights baseline applied to 50 candidates retrieves only marginally better than cosine top-10.

When even the rerank isn’t enough. Very large stores (> 50k episodes per user) where even the coarse recall is noisy. The fix is not a more sophisticated rerank — it’s hierarchical memory paging and sleep-time consolidation to reduce the effective working-set size before retrieval lands. A retrieval policy can only do so much against an over-grown store; the upstream maintenance is what makes the policy’s job tractable.

Trade-offs, failure modes, and gotchas

The static-weights-against-a-shifting-workload bug. Weights set once at deploy time work for the workload that existed at deploy time. As the user’s history grows, their preference distribution shifts, and the optimal weights shift with it. The mitigation is per-user weight learning — run a small evaluation harness offline against each user’s actual query log, find the weights that maximize retrieval-quality metrics, and persist per user. Most teams don’t bother and ship the default; the teams that bother (and instrument) report measurable wins.

The decay-constant cliff. A too-aggressive recency half-life (1-3 days) makes the recency term dominate, and any episode older than a week falls out of the top-K regardless of importance. A too-gentle half-life (90+ days) makes the recency term constant and the policy collapses to importance-plus-similarity. The 7-30 day band is defensible across most workloads; tuning outside that band needs a measured reason.

The normalization-across-batches inconsistency. Min-max normalization across the recalled candidates is what makes the three signals comparable within a query, but it makes scores across queries not comparable — the same episode might score 0.9 in one retrieval and 0.6 in another, depending only on the other candidates. The mitigation: don’t compare scores across queries; the score is an ordering signal within a single retrieval, not an absolute measure of episode quality.

The retrieval-count runaway. A retrieval-count counter that’s never reset can let a handful of frequently-retrieved episodes lock in the top-K permanently. The fix is decay on the retrieval count itself — multiply it by 0.99 per day, or reset it on the sleep-time consolidation pass. Without the decay, the LFU contribution becomes monotonically increasing and the policy ossifies.

The last-read-update-on-every-call bug. Updating last_read_at on every retrieval (without a refresh floor) means a hot query loop flattens the recency curve across the entire top-K — every episode gets bumped, the relative ordering becomes meaningless, and the next retrieval can’t distinguish a genuinely-relevant episode from one that just happened to be in the previous top-K. The refresh-floor parameter is the mitigation; defensible values are 30-120 seconds.

The diversity-collapse failure. Without an MMR pass, a top-K can fill with near-duplicates of the same episode (a long discussion split across many segments, all retrieving together). The model spends the prompt budget on redundant content; the useful coverage is one or two unique episodes plus three duplicates. The MMR deduction (or a simple semantic-deduplication pass) is what keeps the top-K diverse.

The cold-start blindness. A user with no episodic history retrieves nothing on every query; the policy degenerates to “the system prompt is all you have.” This is correct behavior — but it’s also the layer where falling back to a global episodic store (other users’ anonymized episodes in the same domain) can help, with the obvious privacy caveats. The cleaner answer is to populate the new user’s store from a reflection pass over their first few sessions and let the policy work normally from session 3 onward. The cross-session identity article works the full cold-start staircase — anonymous, claimed, bootstrapped, matured, calibrated — and the confidence-marker discipline that keeps the early steps from over-personalizing.

The catastrophic-rerank trap. Adding a cross-encoder rerank stage on top of the multi-signal blend works well until the cross-encoder model goes down (rate-limit, network blip, deployment hiccup). A naive implementation fails the whole retrieval; the user sees an agent that suddenly has amnesia. The mitigation is to make the cross-encoder rerank a post-process on the multi-signal blend — if it fails, return the blend’s top-N directly. The user gets slightly worse retrieval, not no retrieval.

The “we’ll tune the weights later” trap. Shipping with α = β = γ = 1 and never measuring is the default in most projects, and for many workloads it’s fine. The trap is when retrieval quality is the user-facing failure mode but nobody instruments it — the equal-weights default is correct for the published benchmarks, and your workload is not a published benchmark. Build a small evaluation harness from the start (a hand-curated set of (query, expected_episode_id) pairs from real sessions); use it to verify that any retrieval-policy change doesn’t regress; tune the weights against it once the corpus crosses 5k episodes per user. The memory evaluation article is the deep dive on the harness shape, the public benchmarks it should mirror, and the per-category metrics that prevent silent regressions.

The temporal-query blindness. The static weights treat every query the same, even though some queries have explicit temporal markers (“yesterday,” “three months ago,” “originally”) that should invert the recency term, and others have explicit recency markers (“currently,” “now,” “lately”) that should amplify it. The fix is a lightweight intent classifier on the query that adjusts the weights before the rerank; without it, “what did we decide last month?” retrieves like “what should we decide now?” and the user gets confused answers. The temporal-reasoning and provenance article is the deep dive on this and on the related staleness gating that operates one layer above the rerank.

The cross-tenant leak via score. A retrieval policy that doesn’t enforce the where={"user": user} filter on every read is the single most dangerous bug in the memory subsystem — a careless filter on a debug endpoint or a forgotten branch in the read path can surface user A’s episodes to user B with a high score, and the agent will happily quote them. The mitigation is to make the tenant scope structural (the namespace tuple in LangGraph, the required user_id parameter in Mem0) rather than an optional argument; audit every read site for the filter before shipping.

Further reading

  • Generative Agents: Interactive Simulacra of Human Behavior — Park et al., 2023 — the canonical reference for the α·recency + β·importance + γ·similarity formula. The §A.1 retrieval-function appendix has the exact formulation, the 0.995 per-hour decay constant, and the equal-weights baseline. Every production retrieval policy in 2026 is either a port of this formulation or a knowing divergence from it; read it as the empirical baseline the rest of the field calibrates against.
  • MemoryBank: Enhancing Large Language Models with Long-Term Memory — Zhong, Guo et al., AAAI 2024 — introduces the Ebbinghaus-curve-shaped strengthening of memories under repeated retrieval. The mechanism (S = S_0 · exp(-t/η) with η extended by each retrieval) is the cleanest formalization of read-driven salience updates, and the production memory frameworks that ship retrieval-count boosts are mostly variations on this.
  • Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — Chhikara et al., ECAI 2025 — the three-channel hybrid recall (semantic + BM25 + entity graph) and the channel-fusion pattern that underpins the 2026 production retrieval stack. The LOCOMO-benchmark numbers (91% lower p95 latency, ~90% token-cost reduction vs full-context baselines, 26% LLM-as-judge improvement over OpenAI’s memory baseline) are the strongest published evidence that channel fusion beats single-channel recall by a meaningful margin.
  • ARC: A Self-Tuning, Low Overhead Replacement Cache — Megiddo & Modha, USENIX FAST 2003 — the foundational cache-replacement paper that frames the LRU-vs-LFU trade-off as a workload-tunable parameter rather than a fixed design choice. The agent-memory analogue is the per-user weight tuning that Mem0 and Letta both expose; reading the ARC paper makes the trade-off’s shape much more legible than any agent-memory writeup alone.
  • Temporal Reasoning and Memory Provenance — the read-side companion to this article. Adds the bi-temporal as-of filter, the staleness gate, and per-fact provenance chains, turning the three-signal blend into a four-signal blend that respects dated facts and surfaces source episodes alongside every claim.
  • Episode Segmentation and Salience Scoring — the upstream layer that produces the importance term every retrieval depends on. Without anchored salience scoring (the all-7s collapse failure mode), the importance term in the rerank is constant and the policy degrades to recency-plus-similarity.
  • Memory Evaluation: Benchmarks and Custom Evals — the measurement layer for the policy. The precision/recall harness this article gestures at is worked there in full, with the public-benchmark mapping (LongMemEval’s five abilities, LoCoMo’s multi-hop and temporal categories) that calibrates the rerank weights against what the field actually measures.
  • Memory Conflict, Forgetting, and Embedding Drift — the storage-side mirror of this article’s read-side rerank. The eviction queue runs the same Ebbinghaus decay against the same importance and verification signals, applied to deciding what stays in the store rather than what surfaces in the candidate set. Also covers the contradiction resolver and the embedding-drift migration patterns.