jatin.blog ~ $
$ cat ai-engineering/sleep-time-compute.md

Sleep-Time Compute and Memory Consolidation

Sleep-time compute for AI agents: background consolidation, the VACUUM parallel, Letta's sleep-time agents, Claude Code's auto-dream, and the cost math.

Jatin Bansal@blog:~/ai-engineering$ open sleep-time-compute

A long-running coding agent has accumulated 14,000 episodes for one user over six months. The long-term store is healthy, the write policy is doing its job, the reflection pass fires on every importance-threshold breach. And every hot-path call now waits 2.8 seconds for the maintenance overhead to finish — the reflection pass takes 800ms, the compression pass takes 1.2 seconds, the embedding-drift staleness check over a sampled batch takes another 800ms. The user-facing p95 latency, once 600ms, is now 3.4 seconds. The agent feels slow even though the model is fast. None of the maintenance work is wrong, none of it is wasted; the bug is where it runs. Sleep-time compute is the architectural answer to that bug: move the maintenance to a background tier that runs when the user is not waiting, pre-compute the higher-order representations the next hot-path call will need, and let the foreground agent answer in a single fast model call against the materialized state. This article is the deep dive on that tier.

Opening bridge

Yesterday’s piece closed the maintenance-axis triple — compression alongside reflection and the write-policy distillation step. It named the deferred compression strategy (run at session-end or in a background pass) as one of the four trigger modes, but treated the background mechanics as out of scope. Today’s article is the missing piece: not just compression, but every maintenance operation in the memory subsystem — reflection regeneration, write-policy retraining, embedding-drift re-indexing, reflection-staleness re-verification, conflict resolution, dead-episode garbage collection — moved to a background tier that runs while the agent is idle. The frame the rest of the article works from is the one Lin, Snell, Wang, Packer, Wooders, Stoica, and Gonzalez introduced in “Sleep-time Compute: Beyond Inference Scaling at Test-time” (April 2025) and Letta has built out in the months since: there are two compute regimes in an agent system — test-time (the hot path, user is waiting) and sleep-time (the background, user is not waiting) — and the design problem is which work goes in which regime.

Definition

Sleep-time compute is the inference work an agent performs during idle periods between user interactions, against the same context and memory store the test-time path will eventually query, to pre-compute higher-order representations and reduce the test-time work the hot path has to do. Three properties distinguish sleep-time compute from the maintenance operations it shares mechanics with. First, it is temporally offset — sleep-time work runs before the test-time query that consumes it, not in response to it. Second, it is speculative — the sleep-time pass doesn’t know the exact next query, only the space of likely queries, and pre-computes representations that cover that space. Third, it is amortized — the cost of a sleep-time pass is paid once and consumed by many subsequent test-time calls, where the cost of an equivalent test-time pass is paid on every call.

What sleep-time compute is not. It is not reflection — reflection is one operation that might run at sleep time; sleep-time compute is the regime, not the operation. It is not compression — compression is another operation that might run at sleep time; the same regime can hold both. It is not batch inference — batch inference is a throughput-optimized call to the same model that test-time would use; sleep-time compute is the broader category of “any inference work moved off the hot path.” It is not training — sleep-time compute does not update model weights; it updates the context (memory blocks, summaries, indexes, beliefs) the test-time path reads. The Letta team’s framing is the cleanest: sleep-time compute “transforms raw context into learned context” — the same model, the same prompts, but the work happens when nobody is waiting.

Intuition

The mental model that pays off is batch jobs in a transactional database, applied to the agent’s context window. Every production database has two compute regimes: online queries (hot path, user is waiting, latency budget in milliseconds) and offline jobs (cron jobs, VACUUM passes, materialized-view refreshes, statistics updates, log-segment compaction, all running with seconds-to-hours budgets). The design discipline is the same: figure out which work is latency-critical (must run on the hot path), which work is throughput-critical-but-not-latency-critical (can run in the background), and which work is amortizable (can run once and benefit many subsequent online queries). A naive system runs everything on the hot path and pays the full latency cost; a mature system pushes 80% of the work to the background and the hot path serves cached, pre-computed answers.

The agent analogue is exact. The hot-path call is the user’s turn — latency-critical, milliseconds budget, single model call. The sleep-time pass is the offline job — throughput-critical, seconds-to-minutes budget, any number of model calls. Three sleep-time operations cover most of what production agents need: consolidate (reflection over recent episodes, summarization of long spans, embedding regeneration on drift), pre-compute (anticipated-query precomputation as in the Letta paper, materialized views over the belief store, indexed shortcuts for common retrieval patterns), and garbage-collect (dead-episode removal, stale-reflection invalidation, low-quality-memory pruning). The discipline is to keep the hot-path call to a single inference against the materialized state; everything else is sleep-time work that should have already run.

Two design questions force themselves on every sleep-time implementation. The first is when does the sleep-time tier run? Three families of triggers: a cadence trigger (every N minutes, every K user turns, daily) — simple but possibly wasteful; an event trigger (after every session, after an importance-threshold breach, on context-window overflow) — more reactive, harder to reason about budget; an idle-detection trigger (run when no user turn has arrived for T seconds) — closest to the OS-scheduler analogue but requires reliable idle signals. Letta’s sleep-time agent uses an event-trigger by default (every N steps of the primary agent, sleeptime_agent_frequency=5); Claude Code’s auto-dream uses a cadence-trigger (24 hours of activity and at least 5 sessions). The second is what does the sleep-time tier write back to? Three common patterns: the memory store itself (Letta’s shared memory blocks, the episodic store’s reflection rows), a separate pre-computation cache (anticipated answers indexed by likely query), or the system prompt directly (a learned-context block injected on every test-time call). The choice determines who pays the cache-coherence cost — the sleep-time agent on write, or the test-time agent on every read.

The distributed-systems parallel — VACUUM, materialized views, and OLAP

Three honest parallels, each load-bearing.

The PostgreSQL VACUUM operation is the closest single analogue. Postgres uses MVCC — every UPDATE creates a new row version, and dead row versions accumulate over time. Without periodic cleanup, the table grows linearly with write volume even when the logical row count is stable; query performance degrades as the planner has to skip over more dead rows; index bloat compounds the problem. The fix is VACUUM — a background pass that reclaims dead row space, updates the visibility map, and re-statistics the table. Crucially, VACUUM runs without blocking online queries (VACUUM proper takes a shared lock; VACUUM FULL rewrites the table and does block, which is why it’s reserved for emergency reorganizations). The agent-memory analogue is precise: every memory write adds an episode, every reflection emits new beliefs, every compression replaces a span — dead and stale entries accumulate; retrieval quality degrades as the right episode is buried in noise; the fix is a background pass that prunes dead episodes, recomputes statistics (importance distributions, recency profiles), and re-indexes if needed. The sleep-time tier is the agent’s VACUUM.

Materialized-view refresh is the second parallel, and the one Letta’s paper formalizes. Postgres ships REFRESH MATERIALIZED VIEW as a manual operation; production systems schedule it through cron or trigger it on data-change events. The trade-off is canonical: stale-but-fast reads (the view) versus fresh-but-expensive reads (the underlying aggregate). The agent-memory analogue, introduced in the reflection article, is: reflections are materialized views over the episodic store, and sleep-time is when the refresh runs. The Letta paper extends this to predicted queries — the sleep-time agent doesn’t just refresh existing views, it materializes views for queries it anticipates but hasn’t yet received. The mechanism: prompt the sleep-time model with the current context and ask “what are the most likely next questions?”; for each predicted question, pre-compute the answer’s reasoning chain; cache the chain; on the test-time call, serve the cached chain instead of recomputing it.

OLAP-vs-OLTP is the broader frame. Production data systems separate transactional workloads (OLTP — fast small queries against the live state) from analytical workloads (OLAP — slow large queries over historical data), and the architectures diverge precisely because the two have incompatible cost profiles on the same engine. The agent analogue: the test-time path is OLTP (one user turn, sub-second budget, single call against the live context), and the sleep-time path is OLAP (analyze a session-worth of episodes, compute aggregate beliefs, restructure the memory store, multi-minute budget acceptable). Sharing one model and one runtime across both is the same mistake as running OLAP queries on your OLTP replica — it works at small scale and breaks at production scale.

The Letta sleep-time compute paper — the core formulation

The paper that named the regime is Lin et al., “Sleep-time Compute: Beyond Inference Scaling at Test-time” (arXiv:2504.13171, April 17, 2025). Three claims are load-bearing.

Claim 1 — Test-time compute and sleep-time compute trade off. Test-time scaling techniques like chain-of-thought, tree-of-thoughts, and best-of-N sampling improve accuracy by spending more inference at the moment of the query, but the cost is paid synchronously — the user waits and the per-call dollar cost is high. Sleep-time compute moves equivalent reasoning forward in time: the model processes the context before the query arrives, emits intermediate inferences (potential answers to questions the user hasn’t asked yet, refactored reasoning chains, structured indexes over the context), and caches them. When the test-time query arrives, the model has less work to do because the upstream reasoning is already done.

Claim 2 — The benchmarks make the trade-off measurable. The paper introduces Stateful GSM-Symbolic and Stateful AIME — modifications of standard reasoning benchmarks that share a fixed context across many queries, so the sleep-time pre-computation has something to pre-compute against. The Multi-Query GSM-Symbolic extension specifically measures the amortization win: as the number of queries per context increases, the per-query cost of sleep-time compute drops because the upfront cost is amortized over more lookups.

Claim 3 — The numbers are large enough to design around. The headline results: up to 5x lower test-time compute at equal accuracy, 2.5x lower average cost per query through amortization, and up to 18% higher accuracy on Stateful AIME when sleep-time compute is scaled aggressively. The Pareto-improvement framing is the right one to internalize — for the same total compute budget, splitting the budget between sleep-time and test-time gives strictly better quality than spending all of it at test-time.

Two caveats from the paper worth quoting. The win scales with query predictability — workloads where the test-time queries are predictable from the context (the agent’s user has stable preferences, the codebase is fixed, the document set is small) benefit dramatically; workloads where queries are unpredictable (open-domain Q&A against a fresh context every call) benefit modestly or not at all. The win also requires the sleep-time work to be useful — a sleep-time pass that pre-computes the wrong inferences is pure cost. The two together set the boundary: sleep-time compute pays off when the workload has both context stability and query predictability.

The Letta sleep-time agent — the production harness

Letta operationalized the paper as a multi-agent architecture, shipped in Letta 0.7+ and documented in the sleep-time agents guide. The shape is:

Two agents, one memory. The primary agent handles user interaction, calls user-facing tools, searches recall and archival memory — but it cannot edit the core in-context memory blocks. The sleep-time agent runs in the background and is the only one with write access to the shared memory blocks. The two agents communicate exclusively through the shared memory state — there is no direct message passing; the sleep-time agent’s effect on the primary agent’s behavior is mediated entirely by what it writes into the shared blocks. This is the single-writer specialization of the broader multi-agent shared memory story — the cleanest pattern in that taxonomy, because it sidesteps the concurrent-write problem entirely.

Asynchronous triggering. The sleep-time agent runs every N steps of the primary agent (default sleeptime_agent_frequency=5). It reads the recent conversation, runs whatever consolidation logic the operator has configured (reflection, compression, anticipated-query pre-computation), and writes the results back to the shared blocks. The primary agent’s next call sees the updated context without having to wait for the consolidation — the sleep-time pass is on the critical path for future test-time calls, but not for the current one.

Different models on each tier. The cost discipline the Letta team explicitly recommends is to run the sleep-time agent on a cheap small model (Claude Haiku 4.5 at $1/$5 per million tokens) and the primary agent on whatever model the user-facing quality requires (Claude Sonnet 4.5, GPT-5 family). The cost math falls out naturally — the sleep-time agent runs many calls but each call is cheap; the primary agent runs few calls but each call is high-quality. Mixing the two is the most common cost bug in production sleep-time implementations.

Tool-level separation. Letta exposes memory_rethink as a sleep-time-only tool that performs large-scale block rewrites — operations too expensive to do during a user turn. The primary agent only sees memory_insert and memory_replace, which are cheap surgical edits. The tool boundary enforces the discipline: the primary agent cannot invoke the expensive consolidation operation on the hot path even if it wanted to.

The Letta forum’s best-practices guide names the operational gotchas — frequency tuning matters more than the consolidation algorithm, expiry policies are reference-count-based rather than timestamp-based, full reorganization should fire weekly not per-session. The shape is the same operational pattern as a production database’s autovacuum tuning: the algorithm is generic, the tuning is workload-specific, and getting the tuning wrong costs more than getting the algorithm wrong.

Claude Code’s auto-dream — the developer-facing version

The pattern has propagated. Claude Code shipped auto-dream in early 2026 — a periodic memory-consolidation routine that fires after 24 hours of activity and at least 5 new sessions. The four-phase cycle is the cleanest published shape of a sleep-time consolidation pass. Phase one: scan the memory directory, read MEMORY.md, skim existing topic files to understand the current belief state. Phase two: search recent session transcripts for high-value patterns — user corrections, explicit save requests, recurring themes, key decisions. Phase three: merge new facts into the durable memory files; delete contradicted notes; convert relative dates to absolute ones. Phase four: trim the index back under a length budget (200 lines in the public spec). The cycle typically takes 8-10 minutes and is sandboxed — during a dream cycle, Claude can only write to memory files, never to source code or configuration.

The pattern matches the Letta architecture in shape but trades the dual-agent model for a single-agent-with-restricted-mode pattern: there is one Claude Code agent, but in dream mode it runs against a different prompt with different tool permissions. The two designs converge on the same property — the sleep-time path has different writability than the hot path, and the operational discipline is enforced by the harness rather than by convention.

The auto-dream rollout is also the cleanest public example of user-visible sleep-time compute. Most production sleep-time systems are invisible to the end user — the consolidation runs, the test-time call serves the materialized state, the user just experiences faster and more coherent responses. Claude Code surfaces the dream cycle explicitly via /dream and the auto-dream cadence, which gives the user a mental model for why the agent’s memory feels coherent across long sessions. The transparency is a usability win; the underlying mechanism is the same sleep-time-compute pattern.

The runnable implementations

Both implementations realize a minimal sleep-time consolidation worker: a queue of pending episodes, a background loop that triggers at idle, a consolidation pass that runs reflection + compression + dead-episode garbage collection against a shared store. The patterns scale to the full Letta-style two-agent architecture; a single-process background loop is a defensible starting point.

Python — a sleep-time worker against a shared episodic store

Uses the Anthropic SDK for the consolidation calls and Chroma for the episodic substrate. The worker runs in a separate thread with an idle detector that fires the consolidation pass when no test-time turn has arrived for IDLE_THRESHOLD_SECONDS. Install: pip install anthropic chromadb.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# pip install "anthropic>=0.40.0" chromadb
import json
import threading
import time
import uuid
from dataclasses import dataclass, field
from queue import Queue, Empty
from anthropic import Anthropic
import chromadb

client = Anthropic()
chroma = chromadb.PersistentClient(path="./memory_store")
episodes = chroma.get_or_create_collection("episodes")

# ----- tunables -----
SLEEP_TIME_MODEL = "claude-haiku-4-5"   # cheap small model for background work
IDLE_THRESHOLD_SECONDS = 60.0           # fire consolidation after 60s of quiet
MAX_PASSES_PER_HOUR = 12                # rate limit the background loop
RECENT_WINDOW = 50                      # episodes pulled into each pass


@dataclass
class WorkerState:
    """Shared state between the hot-path agent and the sleep-time worker."""
    last_user_turn_ts: float = field(default_factory=time.time)
    pending_user: str | None = None
    passes_run: int = 0
    last_pass_ts: float = 0.0
    stop: threading.Event = field(default_factory=threading.Event)


def note_user_turn(state: WorkerState, user: str) -> None:
    """Call this on every hot-path turn. Updates the idle clock."""
    state.last_user_turn_ts = time.time()
    state.pending_user = user


# ----- the consolidation pass itself -----
CONSOLIDATION_PROMPT = """You are a sleep-time consolidator for an agent's memory.

Below are recent raw episodes for user {user}. Your job is to:
1. Identify groups of episodes that are about the same topic / person / project.
2. For each group, emit a single consolidated note that captures the durable facts.
3. Flag any episodes that contradict each other so the foreground agent can resolve them.
4. Identify episodes that are stale (the user has clearly changed their mind since).

Return JSON with exactly these keys:
{{
  "consolidations": [
    {{"note": str, "source_episode_ids": [str], "topic": str}}
  ],
  "contradictions": [
    {{"episodes": [str], "issue": str}}
  ],
  "stale_episode_ids": [str]
}}

Episodes:
{episodes}
"""


def render_episodes(records: list[dict]) -> str:
    return "\n".join(
        f"[{r['id']}] ({r['metadata'].get('type', 'episode')}, "
        f"importance={r['metadata'].get('importance', 0):.2f}) "
        f"{r['document']}"
        for r in records
    )


def run_consolidation_pass(user: str) -> dict:
    """The sleep-time work itself. Reads episodes, emits consolidations + flags."""
    # Pull a recent window for this user.
    raw = episodes.get(where={"user": user})
    if not raw["ids"]:
        return {"consolidations": [], "contradictions": [], "stale_episode_ids": []}

    triples = list(zip(raw["ids"], raw["documents"], raw["metadatas"]))
    triples.sort(key=lambda t: t[2].get("ts", 0), reverse=True)
    window = [
        {"id": eid, "document": doc, "metadata": meta}
        for eid, doc, meta in triples[:RECENT_WINDOW]
    ]

    prompt = CONSOLIDATION_PROMPT.format(
        user=user, episodes=render_episodes(window)
    )
    resp = client.messages.create(
        model=SLEEP_TIME_MODEL,
        max_tokens=1500,
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    text = resp.content[0].text.strip()
    if text.startswith("```"):
        text = text.split("```")[1].lstrip("json\n").rstrip("`\n")
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        return {"consolidations": [], "contradictions": [], "stale_episode_ids": []}


def apply_consolidation(user: str, result: dict) -> int:
    """Write consolidations as new memory entries, mark stale episodes."""
    written = 0
    for con in result.get("consolidations", []):
        if not con.get("note") or len(con.get("source_episode_ids", [])) < 2:
            continue  # require at least 2 grounding episodes per consolidation
        cid = str(uuid.uuid4())
        episodes.add(
            ids=[cid],
            documents=[con["note"]],
            metadatas=[{
                "user": user,
                "type": "consolidation",
                "topic": con.get("topic", ""),
                "importance": 0.9,  # consolidations inherit high importance
                "ts": time.time(),
                "last_read": time.time(),
                "source_episode_ids": ",".join(con["source_episode_ids"]),
            }],
        )
        written += 1

    # Mark stale episodes by lowering their importance — the recency-weighted
    # retrieval naturally demotes them without losing the audit trail.
    for sid in result.get("stale_episode_ids", []):
        try:
            existing = episodes.get(ids=[sid])
            if existing["ids"]:
                meta = existing["metadatas"][0]
                meta["importance"] = meta.get("importance", 0.5) * 0.3
                meta["stale"] = True
                episodes.update(ids=[sid], metadatas=[meta])
        except Exception:
            pass

    return written


# ----- the background loop -----
def sleep_time_worker(state: WorkerState) -> None:
    """Runs in a background thread. Fires consolidation when the agent is idle."""
    while not state.stop.is_set():
        time.sleep(5.0)  # poll cadence; not the trigger threshold

        now = time.time()
        idle_for = now - state.last_user_turn_ts
        if idle_for < IDLE_THRESHOLD_SECONDS:
            continue  # user just turned; don't compete for resources

        # Rate-limit: don't run more often than budget allows
        if state.last_pass_ts and (now - state.last_pass_ts) < (3600 / MAX_PASSES_PER_HOUR):
            continue

        user = state.pending_user
        if not user:
            continue

        # Do the work
        try:
            result = run_consolidation_pass(user)
            written = apply_consolidation(user, result)
            state.passes_run += 1
            state.last_pass_ts = now
            if written or result.get("contradictions") or result.get("stale_episode_ids"):
                print(
                    f"[sleep-time] pass #{state.passes_run} for {user}: "
                    f"wrote {written} consolidations, "
                    f"flagged {len(result.get('contradictions', []))} contradictions, "
                    f"marked {len(result.get('stale_episode_ids', []))} stale"
                )
        except Exception as e:
            print(f"[sleep-time] pass failed: {e}")


# ----- usage sketch -----
if __name__ == "__main__":
    state = WorkerState(pending_user="alice")
    worker = threading.Thread(target=sleep_time_worker, args=(state,), daemon=True)
    worker.start()

    # Hot-path simulation: every user turn calls note_user_turn.
    # When the user goes quiet for IDLE_THRESHOLD_SECONDS, consolidation fires.
    note_user_turn(state, "alice")
    time.sleep(90)  # simulate a 90-second idle gap
    note_user_turn(state, "alice")  # resume; the background pass has already run

    state.stop.set()
    worker.join(timeout=5)

Four things to notice. First, the idle threshold is the load-bearing parameter — too low and the worker competes with the hot path; too high and the worker never fires for users who don’t have long idle gaps. Sixty seconds is defensible for chat-shaped workloads; coding-agent sessions with multi-minute tool-call cycles can tolerate a much higher threshold (10 minutes plus). Second, the rate limiter is not optional — without MAX_PASSES_PER_HOUR, a buggy idle detector can fire the consolidation pass dozens of times per hour and burn the API budget on redundant work. Third, the cheap-model discipline is structuralSLEEP_TIME_MODEL = "claude-haiku-4-5" is a 5-15x cost reduction vs. running the consolidation on the foreground model, and the consolidation work doesn’t typically benefit from the larger model. Fourth, the staleness marker is a demotion, not a deletion — setting importance *= 0.3 and stale = True lets the recency-weighted retrieval demote the episode naturally while preserving the audit trail. Hard deletion would break the reflection’s evidence chain — every reflection that cited the stale episode would have a dangling pointer.

TypeScript — the same shape with the Vercel AI SDK

The TypeScript port uses the Vercel AI SDK for the consolidation calls and runs the worker via setInterval. Install: pnpm add ai @ai-sdk/anthropic zod chromadb.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
// pnpm add ai @ai-sdk/anthropic zod chromadb
import { generateObject } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { ChromaClient } from "chromadb";
import { z } from "zod";
import { randomUUID } from "node:crypto";

const chroma = new ChromaClient({ path: "http://localhost:8000" });
const SLEEP_TIME_MODEL = anthropic("claude-haiku-4-5");

const IDLE_THRESHOLD_MS = 60_000;
const MAX_PASSES_PER_HOUR = 12;
const RECENT_WINDOW = 50;

const ConsolidationSchema = z.object({
  consolidations: z.array(
    z.object({
      note: z.string(),
      source_episode_ids: z.array(z.string()),
      topic: z.string().optional().default(""),
    }),
  ),
  contradictions: z.array(
    z.object({
      episodes: z.array(z.string()),
      issue: z.string(),
    }),
  ),
  stale_episode_ids: z.array(z.string()),
});
type ConsolidationResult = z.infer<typeof ConsolidationSchema>;

type WorkerState = {
  user: string | null;
  lastUserTurnTs: number;
  passesRun: number;
  lastPassTs: number;
  stop: boolean;
};

export const newWorkerState = (): WorkerState => ({
  user: null,
  lastUserTurnTs: Date.now(),
  passesRun: 0,
  lastPassTs: 0,
  stop: false,
});

export const noteUserTurn = (state: WorkerState, user: string): void => {
  state.lastUserTurnTs = Date.now();
  state.user = user;
};

async function runConsolidationPass(user: string): Promise<ConsolidationResult> {
  const collection = await chroma.getOrCreateCollection({ name: "episodes" });
  const all = await collection.get({ where: { user } });
  if (!all.ids.length) {
    return { consolidations: [], contradictions: [], stale_episode_ids: [] };
  }
  type Rec = { id: string; document: string; metadata: Record<string, unknown> };
  const triples: Rec[] = all.ids.map((id, i) => ({
    id,
    document: all.documents?.[i] ?? "",
    metadata: all.metadatas?.[i] ?? {},
  }));
  triples.sort(
    (a, b) =>
      ((b.metadata as { ts?: number }).ts ?? 0) -
      ((a.metadata as { ts?: number }).ts ?? 0),
  );
  const window = triples.slice(0, RECENT_WINDOW);
  const renderEpisodes = window
    .map(
      (r) =>
        `[${r.id}] (${(r.metadata as { type?: string }).type ?? "episode"}, ` +
        `importance=${((r.metadata as { importance?: number }).importance ?? 0).toFixed(2)}) ` +
        r.document,
    )
    .join("\n");

  const prompt =
    `You are a sleep-time consolidator for an agent's memory. Below are recent episodes for user ${user}. ` +
    `Group related episodes, emit consolidated notes (require >=2 source episodes), flag contradictions, ` +
    `and list episodes the user has clearly superseded.\n\nEpisodes:\n${renderEpisodes}`;

  const { object } = await generateObject({
    model: SLEEP_TIME_MODEL,
    schema: ConsolidationSchema,
    prompt,
  });
  return object;
}

async function applyConsolidation(
  user: string,
  result: ConsolidationResult,
): Promise<number> {
  const collection = await chroma.getOrCreateCollection({ name: "episodes" });
  let written = 0;

  for (const con of result.consolidations) {
    if (!con.note || con.source_episode_ids.length < 2) continue;
    await collection.add({
      ids: [randomUUID()],
      documents: [con.note],
      metadatas: [
        {
          user,
          type: "consolidation",
          topic: con.topic ?? "",
          importance: 0.9,
          ts: Date.now() / 1000,
          last_read: Date.now() / 1000,
          source_episode_ids: con.source_episode_ids.join(","),
        },
      ],
    });
    written += 1;
  }

  for (const sid of result.stale_episode_ids) {
    try {
      const existing = await collection.get({ ids: [sid] });
      if (!existing.ids.length) continue;
      const meta = existing.metadatas?.[0] ?? {};
      const imp = (meta as { importance?: number }).importance ?? 0.5;
      await collection.update({
        ids: [sid],
        metadatas: [{ ...meta, importance: imp * 0.3, stale: true }],
      });
    } catch {
      /* ignore missing rows */
    }
  }

  return written;
}

export function startSleepTimeWorker(state: WorkerState): () => void {
  const interval = setInterval(async () => {
    if (state.stop) return;
    const now = Date.now();
    const idleFor = now - state.lastUserTurnTs;
    if (idleFor < IDLE_THRESHOLD_MS) return;
    if (
      state.lastPassTs &&
      now - state.lastPassTs < (3_600_000 / MAX_PASSES_PER_HOUR)
    ) {
      return;
    }
    if (!state.user) return;

    try {
      const result = await runConsolidationPass(state.user);
      const written = await applyConsolidation(state.user, result);
      state.passesRun += 1;
      state.lastPassTs = now;
      if (
        written ||
        result.contradictions.length ||
        result.stale_episode_ids.length
      ) {
        console.log(
          `[sleep-time] pass #${state.passesRun} for ${state.user}: ` +
          `wrote ${written} consolidations, ` +
          `flagged ${result.contradictions.length} contradictions, ` +
          `marked ${result.stale_episode_ids.length} stale`,
        );
      }
    } catch (e) {
      console.error("[sleep-time] pass failed", e);
    }
  }, 5_000);

  return () => {
    state.stop = true;
    clearInterval(interval);
  };
}

The architectural shape is identical to the Python version: idle detection, rate limiting, cheap-model consolidation, staleness demotion. The TypeScript version uses generateObject from the Vercel AI SDK to bind the structured-output schema directly to the consolidation call; the Python version achieves the same constraint via prompt-anchored JSON. Both rely on the same insight: the sleep-time worker’s correctness depends on the structured schema, because a free-form output is hard to apply mechanically and easy to misinterpret.

Anticipated-query pre-computation — the second sleep-time operation

Consolidation is the cleanup side of sleep-time work; pre-computation is the speculative side, and it’s what the Letta paper specifically validates. The mechanism: at sleep time, prompt the model with the current context and ask “what are the K most likely next user queries?”; for each predicted query, run the full test-time reasoning chain now, and cache the result keyed by a hash of the query plus the context. When the actual test-time query arrives, the harness checks the cache: a hit serves the pre-computed answer in one cheap lookup; a miss falls through to the regular test-time path.

The implementation is straightforward — a sleep-time worker generates predicted queries via a small model, runs the foreground model against each, stores the results in a cache keyed by embedding-similarity rather than exact match, and the test-time hot path checks the cache before invoking the full reasoning. The win is the Letta paper’s claim: 5x lower test-time compute, 2.5x lower per-query cost when the prediction quality is good. The cost is the sleep-time pass that emits the predictions and runs the speculative reasoning — non-trivial, and entirely wasted if the predicted queries don’t match the actual ones.

The design knob is prediction quality. Two patterns help. First, bias the predictions toward the recent context — a user who has just asked about deployment is more likely to ask about deployment again than about an unrelated topic. Second, score the cached predictions and serve only high-confidence hits — a low-confidence match falls through to the test-time path rather than serving a stale answer. The pattern is the same as a database cache with conservative TTL and explicit invalidation: cache aggressively, serve cautiously.

When sleep-time compute beats running-on-the-hot-path — and when it doesn’t

Sleep-time compute wins when the workload has both context stability and query predictability. A coding agent working on a fixed codebase with a fixed user has both — the context (the codebase, the user’s preferences) doesn’t change call-to-call, and the query distribution is concentrated on a small space (debug this, refactor that, explain this function). A long-running customer-support agent with a stable customer history has both. A personal-assistant agent in long conversation has both. These are the workloads the Letta paper’s 5x-2.5x-18% numbers apply to most directly.

Sleep-time compute wins when the foreground call’s latency budget is tight. Interactive agents with sub-second budget targets cannot afford to run reflection, compression, and embedding-drift checks on the hot path; the latency cost is the user-facing failure mode. Sleep-time compute is the only architectural answer that preserves the hot-path budget while still doing the maintenance work.

Sleep-time compute wins when the maintenance operations are amortizable. A reflection pass over 50 episodes is one model call that can serve dozens of subsequent retrievals; running the reflection on each retrieval would be N model calls for N retrievals. The wider the amortization ratio, the more the sleep-time path dominates. Compression, materialized-view refresh, embedding regeneration — all share this shape.

Hot-path compute wins when the workload is unpredictable and short-lived. A chatbot answering open-domain questions with no user history has neither context stability nor query predictability — sleep-time work is pure overhead because there’s nothing to amortize. Same for batch-processing agents that handle one query at a time and never revisit the same context. Don’t ship a sleep-time tier for these workloads; the cost is real and the benefit is zero.

Hot-path compute wins when the sleep-time pass would compete with the hot path for the same resources. A single-machine deployment with no idle time has nowhere to put the sleep-time work without slowing down the hot path. The Letta paper assumes a separate compute budget for the sleep-time tier; if that assumption doesn’t hold, the architecture collapses. The fix is either dedicated background workers (the Letta architecture) or a deployment with genuine idle windows (most user-facing agents qualify because users sleep, take breaks, and context-switch).

Hot-path compute wins for low-importance state. Not every memory operation deserves a sleep-time pass. Trivial maintenance (clearing a transient working-memory scratchpad, updating a single counter) is cheaper on the hot path than the orchestration overhead of a sleep-time queue. The discipline: sleep-time the operations whose cost dominates the orchestration, not the ones whose cost is dominated by the orchestration.

Trade-offs, failure modes, and gotchas

The over-consolidation cliff. The most-common sleep-time bug: the consolidator runs too often, consolidates too aggressively, and loses the granularity the test-time path needs. The user asks “when did I tell you I was vegetarian?” and the agent answers “you’re vegetarian” — technically correct, factually missing the actual query. The fix is the same as the reflection-vs-summarization distinction: consolidations are additions to the store, not replacements; the raw episodes are preserved alongside the consolidation; the read path can ask for either tier. The Letta forum’s best-practices guide names “don’t over-consolidate” as one of the top three operational gotchas.

The stale-precomputation bug. Anticipated-query pre-computation caches an answer based on the context at sleep time; if the context changes between the cache write and the test-time read (new episode admitted, contradicting fact written, embedding model updated), the cached answer is now stale. The mitigation is a cache invalidation signal — every write to the underlying store bumps a version counter; the cache check verifies the version matches before serving. Without this, the sleep-time tier silently degrades quality on every store update.

The cost-doubling failure. A sleep-time tier that runs too often, doesn’t rate-limit, and uses the same expensive model as the foreground agent is the worst of both worlds — the full hot-path latency plus an equivalent background bill. The mitigation is the discipline named above: cheap-model on the sleep-time tier, hard rate limits, idle-only triggering. The Letta team’s recommended sleep-time frequency (sleeptime_agent_frequency=5 to 10, run on Claude Haiku 4.5 not Sonnet) is a defensible default; deviating from it without a measurement reason is the canonical cost bug.

The race-condition-on-shared-state bug. The sleep-time worker and the hot-path agent share the memory store; concurrent writes can produce inconsistent state. Two episodes get marked stale by the sleep-time worker while the hot-path agent is reading them; a consolidation gets written while a reflection is being computed; the test-time agent sees half-applied state. The mitigation is the same as in any concurrent system: optimistic locking (versions on each row, retry on conflict), or single-writer discipline (only the sleep-time worker writes; the hot-path agent reads only). Letta enforces the latter — the primary agent cannot write to core memory blocks at all. The cleaner design, and the canonical reference point in the multi-agent shared memory taxonomy for “when in doubt, restrict writes to one agent.”

The prediction-quality-degenerate-to-noise failure. A sleep-time pass that emits low-quality predicted queries fills the cache with junk; the test-time path’s cache hits are wrong; the agent quality drops. The mitigation is to measure the cache hit quality (probe the agent’s answer with and without the cache; compare) and prune low-quality cache entries. The same Goodhart’s-law trap as compression — optimizing for cache hit rate without measuring quality degrades the system.

The sleep-time-as-procrastination anti-pattern. A team adopts sleep-time compute and starts deferring everything to the sleep-time tier — including operations that are genuinely latency-critical (the user’s immediate question depends on this consolidation; deferring means a worse answer). The discipline is to ask: does the test-time path need the result of this operation, or does it benefit from it? Latency-critical operations stay on the hot path; benefit-only operations move to sleep time. Confusing the two produces an agent that responds fast but with worse-quality answers — the wrong trade-off in most workloads.

The idle-detection-too-strict failure. A user who interacts in bursts every five seconds for an hour will never trigger a 60-second idle threshold; the sleep-time pass never runs; the consolidation work accumulates indefinitely. The mitigation is a fallback cadence — even if the idle threshold isn’t met, fire a consolidation pass after K turns (Letta’s sleeptime_agent_frequency=5 is this fallback), or after T total wall-clock time. The combination (idle-or-cadence, whichever fires first) handles both spiky and steady workloads.

The cache-bloat-from-stale-predictions trap. The pre-computation cache grows with every sleep-time pass; without a TTL or LRU eviction, it bloats indefinitely and the lookup cost itself becomes a hot-path drag. The fix is the standard cache hygiene — TTL on entries, LRU eviction, size cap. The defensible defaults are 24-hour TTL, 10k-entry cap, LRU eviction.

The “we’ll add sleep-time compute in v2” trap. Sleep-time compute is one of the operations that’s hard to bolt on after the fact — the data structures, the cache surfaces, the model-routing logic, and the idle-detection hooks all touch the harness. Teams that defer it usually find that the v1 hot-path harness has implicit assumptions (everything runs in one process, every operation is synchronous, all writes happen on the test-time path) that don’t survive contact with a sleep-time tier. The architectural cost of adding sleep-time compute later is often higher than the cost of building it in from the start with a no-op worker.

The maintenance-blind-spot bug. Sleep-time compute moves the maintenance work out of sight of the hot path’s logs and tracing. A test-time path that used to fail loudly because the embedding model went stale now silently serves slightly-worse retrievals because the sleep-time embedding-regeneration job has been failing for a week. Instrument the sleep-time tier as carefully as the hot-path tier — every pass logged, every write counted, every failure alerted. The hot-path tracing is what alerts you to test-time bugs; the sleep-time tracing is what alerts you to the bugs the test-time bugs are now hiding.

Further reading

  • Sleep-time Compute: Beyond Inference Scaling at Test-time — Lin, Snell, Wang, Packer, Wooders, Stoica, Gonzalez (April 2025) — the foundational paper. The test-time-vs-sleep-time formulation, the Stateful GSM-Symbolic and Stateful AIME benchmarks, and the Pareto-frontier framing are all introduced here. The 5x compute reduction and 18% accuracy improvement numbers are the headline results worth internalizing.
  • Sleep-time Compute — Letta (2025) — the productization writeup that pairs the paper with the Letta 0.7+ multi-agent architecture. The two-agent harness, the shared memory-block model, the sleeptime_agent_frequency parameter, and the cost discipline (cheap model on the sleep-time tier) are all named here. The most practical reading on how to actually deploy the pattern.
  • Continual Learning in Token Space — Letta (2025) — the broader framing of why sleep-time compute matters: “updates to learned context, not weights, should be the primary mechanism for LLM agents to learn from experience.” Names the three advantages (interpretability, portability, control) and positions sleep-time compute as one of the load-bearing mechanisms of token-space learning. Worth reading alongside the original paper.
  • LightMem: Lightweight and Efficient Memory-Augmented Generation — Fang et al. (October 2025) — an offline-consolidation pipeline that explicitly decouples consolidation from online inference, with reported token-usage reductions of 117× and runtime reductions of 12×. The Atkinson-Shiffrin-inspired three-stage architecture (sensory → topic-aware short-term → offline long-term) is a useful contrast to the Letta two-agent shape.
  • Claude Code Auto Dream Explained: Memory Like REM Sleep — the practitioner-facing writeup of Claude Code’s auto-dream feature. The four-phase consolidation cycle (scan → search transcripts → merge → trim) and the 24-hour-plus-5-sessions trigger cadence are the cleanest public examples of a sleep-time consolidation pass shipped in a production developer tool. Pairs naturally with the Letta architecture as the single-agent-with-restricted-mode counterpart.
  • Summarization and Context Compression — the most-common operation that runs on the sleep-time tier. Compression is one of the heaviest maintenance passes in the memory subsystem; running it at sleep time rather than on the hot path is the canonical use case for the architecture this article covers.
  • Reflection: From Experiences to Beliefs — the second canonical sleep-time operation. The Generative Agents reflection loop is expensive enough that running it on the hot path is rarely defensible; sleep-time scheduling is the production answer.
  • Long-Term Memory: Vector-Backed Episodic Storage — the substrate sleep-time work operates on. The episodic store is the table; reflections and consolidations are the materialized views; sleep-time is when the views get refreshed.
  • Memory Write Policies: What’s Worth Remembering — the upstream layer. The write policy’s hot-path-vs-deferred-vs-background trade-off is the same shape as the test-time-vs-sleep-time trade-off, applied to the write path rather than the maintenance path.