jatin.blog ~ $
$ cat ai-engineering/procedural-memory.md

Procedural Memory and Skill Caching

Procedural memory for AI agents: caching successful action sequences as a JIT-compiled-routine store. Voyager, AWM, LangMem, Agent Skills.

Jatin Bansal@blog:~/ai-engineering$ open procedural-memory

A coding agent has been running for the same team for three months. The episodic store holds 5,800 segmented episodes; the semantic store has 312 distilled facts about the team’s stack, the team’s preferences, and the team’s repos. Recall is high on both tiers. The agent can answer “when did we last deploy the billing service?” (episodic) and “what test command does this team run before merging?” (semantic) without breaking stride. And it still re-derives, every Monday morning, the five-step deployment ritual it has executed forty-seven times this quarter. Every single Monday, the agent reads the deploy README from scratch, asks the user what flags to pass to the migration tool, and discovers (again) that the staging healthcheck takes 45 seconds to settle. The user has stopped being polite about it. The fix is not more episodic memory, not more semantic memory, not a bigger context window, and not a smarter model. The fix is the tier the agent doesn’t have: a place to cache the procedure itself — the sequence of tool calls that succeeded last time — keyed by the shape of the task, retrieved at the recognition moment, injected as a candidate plan before the model starts thinking. That tier is procedural memory, and this article is the deep dive on it.

Opening bridge

The cognitive taxonomy named procedural memory as the fourth of the four CoALA types — alongside working, episodic, and semantic — and gave it the JIT-compiled-routine-cache parallel, but the deep dive went to the other three tiers. The memory write policies article mentioned that procedural writes are success-gated and slow, but treated the gate as an aside on the general write pipeline. The hierarchical memory article noted that “the agent’s procedural-memory equivalent (cached recipes, successful action sequences) usually lives” in the cold archival tier, without spelling out the mechanics. Today’s piece pulls those threads together. Procedural memory is the agent’s learned-skills layer, structurally separate from the conversation history and the world facts because its lifecycle is different — written rarely (after a success has been observed), read frequently (every time a task with the right shape appears), keyed by task shape rather than by content. The mental model the rest of the article works from: procedural memory is to the agent loop what a JIT compiler’s compiled-code cache is to a virtual machine. Compile the hot path once; retrieve and inline it next time; skip the cost of re-deriving it from source.

Definition

Procedural memory is the storage tier that holds cached, replayable action sequences indexed by the shape of the task that succeeded with them. Three properties distinguish a procedure from any other memory type. First, it is executable — the value isn’t a fact or an observation but a sequence of tool calls, code, or steps that the agent can replay. Second, it is success-anchored — every stored procedure points back to a verified successful execution, with enough context (inputs, environment, outcome metric) to know what counts as “the same task” on re-retrieval. Third, it is task-shape-keyed, not content-keyed — the retrieval key is the embedding of the natural-language description of what kind of task this procedure solves, not the embedding of the procedure’s text itself. Mix up any of the three properties and you’ve built something else (a code search index, an episode log, a knowledge base).

What procedural memory is not. It is not tool selection at scale — that’s about picking a single tool from a large catalog at the moment of the next call; procedural memory is about retrieving a multi-step recipe of tool calls that succeeded last time. It is not reflection — reflection produces beliefs from episodes; procedural memory produces cached programs from episodes, and the read path consumes them differently (a belief substitutes for reasoning; a procedure substitutes for re-derivation). It is not the agent harness’s prompt template — the prompt template is the agent’s fixed instructions; procedural memory is the dynamic, learned, success-conditioned overlay on top of those instructions. And it is not just a special-case knowledge base — a knowledge base answers “what’s true?” and a procedural store answers “what’s the next move that worked here last time?” The verbs are different; the indexing is different; the read pattern is different.

Intuition

The mental model that pays off is the JIT compiler’s compiled-code cache. A virtual machine — JVM HotSpot, V8, .NET CLR — does not just interpret bytecode every time it sees a method. It tracks how often each method runs; once the count crosses a threshold (HotSpot’s default is 10,000 invocations for the C2 tier), the JIT compiles the bytecode into native machine code and stores it in a code cache keyed by method signature. Subsequent calls to that signature retrieve the compiled native code directly and skip the interpreter entirely. The compile is expensive; the cache hit is essentially free. The trade-off only pays off because (a) the same method is called many times and (b) the compiled output is correct enough to substitute for the original — if either fails, the JIT is wasted work.

Agent procedural memory recapitulates this almost exactly, with one substitution: the “method” is a task description and the “compiled code” is the successful action sequence. The agent observes that the deployment task gets requested every Monday; the harness records the successful action sequence after the first success; the cache stores the procedure indexed by an embedding of “deploy the billing service to production.” Next Monday, the harness retrieves the procedure on a similarity match and injects it into the prompt before the model starts reasoning. The model doesn’t have to re-derive the steps. The latency of the embedding lookup is amortized against the much larger latency of the model walking the deploy README and re-deriving the procedure from first principles. The cost ratio is similar to JIT — interpretation/derivation is ~100× slower than cache-hit execution/injection — and the cache-hit threshold (“after how many successes is it worth caching?”) is the design knob that determines whether the tier pays off in your workload.

Three intuitions worth carrying through the rest of the article.

Procedures are programs, not data. A stored procedure is closer to a function than to a fact. The read path retrieves it and the agent executes it (or composes it with other procedures, or adapts it). Treating the procedural store as “just another vector index over text” misses the structural property that makes the tier useful — the stored object is operational. The right substrate often involves typed schemas (description + steps + preconditions + postconditions), not just a free-text blob.

Success-gating is the bouncer. Every other memory tier admits writes liberally; procedural memory is the strictest of the four. A procedure that gets cached after a single noisy success becomes a footgun the next time the harness retrieves it. The cleanest empirical evidence is from Voyager (Wang et al., 2023), which gets away with single-success writes because Minecraft is deterministic enough that one success is strong signal; the same policy in a fuzzier domain (customer-support workflows, code review, multi-tenant deployments) produces a corrupted store. The bouncer at the door is the difference between a skill library and a junk drawer.

Task-shape keying is the design surface. The hardest design decision in the tier is the embedding of what. If you embed the procedure’s text, you retrieve on textual similarity to what you executed last time — useless when the new task is described differently but is structurally the same. If you embed the original user request verbatim, you retrieve on surface words — “deploy the user service” and “ship the user service” miss each other. The trick is to embed a normalized task signature — a model-generated abstract description of “what kind of task is this” — and that normalization itself is where most production procedural stores succeed or fail.

The cognitive grounding — declarative vs. non-declarative memory

The cognitive science here is precise and well-validated. Larry Squire’s declarative/non-declarative split (1992) is the canonical distinction. Declarative memory (Tulving’s episodic + semantic) is “knowing that” — propositional knowledge you can verbalize. Non-declarative memory — which includes procedural memory, motor skills, priming, and habit — is “knowing how,” and the defining property is that it doesn’t require conscious retrieval. You don’t recite the steps of riding a bicycle before executing them; the motor program activates from the situational cue. The brain-systems evidence is overwhelming. Amnesic patients with hippocampal lesions can lose declarative memory entirely (they can’t remember anything new from yesterday) while retaining procedural memory — they can learn new motor skills, even though they can’t remember the practice sessions. The two systems are physically separate (medial temporal lobe vs. striatum/cerebellum) and the dissociation is bidirectional.

Two implications port directly to agent architectures.

Procedural memory is recognition-triggered, not recall-triggered. Declarative memory is queried — “what was the deploy command?” Procedural memory activates — the situation cues the program, and execution begins. The agent-architecture parallel: the procedural store’s retrieval should fire automatically when a new task arrives, before the model decides whether to reason or to act. The harness inserts the top-K matching procedures into the prompt as candidate plans, the model picks one (or composes from them), and execution proceeds. Putting procedural retrieval behind an explicit “should I look this up?” decision is closer to declarative-memory access and forfeits the latency win. The pattern is the same as a CPU’s branch predictor: speculate that you’ll need the procedure, prefetch it, pay the small cost on misses to save the big cost on hits.

Skill compilation is gradual, not atomic. Motor-learning research (Fitts’s stages, Anderson’s ACT-R) is consistent: a skill goes from declarative (explicit step-by-step) to associative (chunked but still attended) to autonomous (automatic, low-attention) over many practice instances. The early stage is where errors are caught; the late stage is where the program runs without interruption. Agent procedural memory has the same lifecycle. A first successful trajectory enters as a candidate procedure; after N additional retrievals-and-successes it gets promoted to the eligible-for-injection tier; after M further successes it gets the high-confidence tag that lets the agent skip the verification step on retrieval. Skipping the gradual compilation and writing every first success directly to the high-confidence tier produces a store that’s confident about brittle procedures — the agent-memory analogue of a JIT that promotes methods to the hottest tier after one call.

The SOAR ancestor — chunking as the original procedural-memory mechanism

The agent-memory community is rediscovering what the cognitive-architecture community had in the 1980s. John Laird’s SOAR architecture — and the long retrospective in Laird’s 2022 paper — has had a procedural-memory mechanism, called chunking, since Laird, Rosenbloom, and Newell (1986). The mechanism is precise: when SOAR solves a subgoal by working through it explicitly, the architecture compiles the successful problem-solving steps into a production rule (an if-then rule) and adds it to long-term procedural memory. The next time a similar subgoal appears, the rule fires directly and the explicit problem-solving is skipped. SOAR’s chunking documentation walks through the mechanism in detail.

Three properties of SOAR chunking carry over to LLM-era procedural memory.

The unit of compilation is the subgoal, not the whole task. SOAR doesn’t chunk an entire problem-solving session into one rule; it chunks the subgoal-level pieces that resolved sub-impasses. The LLM-era port: a stored procedure should be a coherent sub-task (deploy a service, onboard a user, debug a 5xx), not an entire conversation. Whole-conversation procedures don’t generalize; sub-task procedures do.

The compiled rule is conditioned on what was actually attended to. SOAR’s chunker only includes in the rule the working-memory elements that were used in solving the subgoal, not every element present. The result is general — the rule fires in any context where the same essentials hold, even if the surface context differs. The LLM-era port: when distilling a procedure from a successful episode, extract only the load-bearing inputs and conditions, not the full trajectory. A procedure that’s over-conditioned (“deploy when the user is logged in from Slack on Monday afternoon”) generalizes worse than one that’s appropriately abstracted (“deploy when an authorized user issues the deploy command”).

Over-eager chunking is a known failure mode. SOAR’s early implementations would chunk every subgoal solution, producing a rule explosion that slowed retrieval more than it sped up reasoning. The fix was chunking guards — heuristics on when chunking pays off. The LLM-era equivalent is the success-gate plus the repetition threshold: don’t store the candidate procedure after one success; don’t promote it to high-confidence after one retrieval. The SOAR community spent two decades calibrating these guards; the LLM community is repeating the same calibration with the same mistakes.

The distributed-systems parallel — JIT compilation, memoization, and capability discovery

Three parallels, each load-bearing.

JIT compiler code cache. The mechanic in the previous section. The agent’s procedural store is a JIT compiled-code cache for the agent loop, with a longer reuse horizon and a fuzzier match function. The two design surfaces are the same: when to compile (the threshold; in HotSpot, default 10,000 method invocations for C2; in agents, often 1-3 successful executions), and what to invalidate (when does a cached procedure go stale?). The invalidation problem is the cache-coherence problem, and it’s load-bearing for both — a JIT that doesn’t invalidate when the class hierarchy changes produces wrong answers; an agent procedural store that doesn’t invalidate when the environment changes (the build system was upgraded, the API key rotated, the deploy target moved) produces confidently wrong actions. The mitigation in both cases is the same: stamp each cached entry with the version of its dependencies, and invalidate on version mismatch.

Memoization with cache invalidation. A pure memoization cache is keyed by (function, args) and is correct because pure functions are deterministic. Agent procedures are not pure — the environment changes, the user changes their mind, the API behavior drifts. The right port isn’t pure memoization but memoization with explicit invalidation triggers: the cache key includes a description of the relevant environment state, and a background pass or an on-execution check invalidates entries when the environment they depended on has moved. The pattern is identical to a React useMemo with a dependency array — the array is the contract for “what does my correctness depend on?” — and getting the dependency-list right is the most subtle part of the design. The memory conflict and forgetting article discussed the analogous problem on the semantic side; the procedural-store version is harder because the dependencies are richer (a code procedure depends on the runtime; a deploy procedure depends on the target environment).

Capability discovery and progressive disclosure. Distributed services advertise their capabilities through discovery protocols (Consul, Eureka, mDNS) — when a new client comes online, it learns what’s available without having to know every endpoint up front. Anthropic’s Agent Skills (announced October 16, 2025) ports this pattern to procedural memory through a mechanism Anthropic calls progressive disclosure: each skill is a folder with a SKILL.md file that begins with YAML frontmatter — name and description — plus an optional body and bundled files. At session start, only the name and description of every skill are loaded into the system prompt; the agent decides whether a skill is relevant on metadata alone. If relevant, the full SKILL.md body is loaded into context. If the body references additional files (a playbook.md, a scripts/ directory), those load only when needed. Three tiers, each only paying its cost when the level above it has earned the load. The distributed-systems parallel: the service registry tells you what exists cheaply; the service description tells you how to call it on demand; the implementation details are deferred until execution time. Agent Skills is the production-grade port of capability discovery for procedural memory, and its progressive-disclosure shape is the right answer to the “how do you have 100 skills without burning 100 skills’ worth of context every turn?” problem.

The Voyager skill library — the canonical reference implementation

Voyager (Wang, Xie, Jiang et al., 2023) is the cleanest production-shaped example of a procedural-memory tier and is the implementation every LLM-era skill library is in conversation with. The mechanism, in five steps.

Step 1 — Skill proposal. Voyager’s automatic curriculum generates a task (“mine 5 iron ore”) and proposes JavaScript code (using the Mineflayer bot APIs) that should accomplish it.

Step 2 — Iterative refinement via execution feedback. The proposed code runs in the Minecraft environment. Execution errors, environment feedback (inventory state, world state), and a self-verification module are fed back into the prompt for another round of code generation. The loop runs until the task succeeds or a step budget exhausts.

Step 3 — Skill admission on verified success. When the self-verification module confirms success, Voyager adds the successful JavaScript program to the skill library. The library is keyed by an embedding of the natural-language description of the skill (generated by GPT-3.5 as a separate description-generation step), not by the code’s text. The value is the JavaScript code itself.

Step 4 — Skill retrieval at the start of a new task. For each new task, Voyager queries the skill library with the embedding of the task’s plan-and-environment-feedback string and retrieves the top-5 most relevant skills. The retrieved skills are inserted into the GPT-4 prompt as code examples — “here are skills that succeeded for similar tasks; use them as building blocks.”

Step 5 — Skill composition. New skills built by the agent often call previously stored skills as subroutines, so the library compounds: a high-level skill (“build a stone pickaxe”) is implemented as a sequence of lower-level skill calls. The library grows from a few primitive skills to hundreds of composed routines over a Voyager run.

Voyager’s reported results are the empirical case that procedural memory pays off. The paper reports that Voyager obtains 3.3× more unique items, traverses Minecraft’s tech tree milestones up to 15.3× faster, and generalizes to fresh Minecraft worlds in ways the baseline (no skill library) does not. The mechanism is the JIT-cache parallel made concrete: the first iron-mining task takes the model many code-generation iterations to solve; the second iron-mining task retrieves the cached skill and executes it in one shot.

Three Voyager design choices worth flagging for production ports.

Embedding the description, not the code. The retrieval key is the embedding of “mine iron ore in the overworld” — the task description. Voyager generates the description separately (a description-extraction LLM call after the code succeeds) precisely because the code itself is the wrong retrieval key. Procedures with semantically equivalent intent but different code shouldn’t cluster apart in embedding space; procedures with similar code but different intent shouldn’t cluster together. The description is the abstract over which similarity actually means something.

No repetition threshold. Voyager writes every first success, no waiting. This is the right call in Minecraft (a single self-verified success is reliable signal in a deterministic environment) and the wrong call in fuzzier domains. A customer-support agent shouldn’t cache a deeply-conditioned, single-success procedure as if it were a general skill. Production ports almost always add a repetition threshold of 2-3 successes before promotion.

Composition over re-derivation. Voyager’s most compounding design choice is that new skills call old skills. The library isn’t a flat set of recipes; it’s a hierarchy where “build a stone pickaxe” calls “mine wood” and “craft sticks” and “mine stone” as primitives. The LLM-era extension — function-call composition over a typed skill registry — is the property that turns a flat cache into a library, and it’s what makes the asymptotic cost of solving a complex task drop over time.

Agent Workflow Memory — workflows over web tasks

Agent Workflow Memory (Wang, Mao, Fried, Neubig, 2024) is the cleanest port of Voyager’s idea to the web-agent setting, and the methodology is interesting enough to be worth its own walk-through. The setting is web navigation — Mind2Web and WebArena, where an agent issues clicks, types, and form submissions to complete a goal. (The dedicated piece on computer-use and browser agents walks through the live-DOM and screenshot-loop variants that make today’s agents possible on those benchmarks.) The challenge: the agent’s actions are not strictly composable in the Voyager sense (you can’t trivially “compose two click sequences”) because each web app has its own DOM structure and side effects.

AWM’s response is to induce, after each successful task, a higher-order workflow — a generalized pattern abstracted away from the specific URLs, IDs, and DOM selectors of the trial — and to store the workflow as a natural-language procedure rather than as raw code. The workflow includes:

  • A description of the task class (the retrieval key).
  • The sequence of action types (type into search, click result link, extract field).
  • The variables that need to be filled in at execution time.
  • The conditions under which this workflow applies.

At test time, the agent retrieves the most-similar workflow for a new task and uses it as a plan template — the steps are concrete enough to follow but abstract enough to apply across different sites. AWM reports relative success-rate improvements of 24.6% on Mind2Web and 51.1% on WebArena over the no-workflow baseline.

Two AWM design choices worth carrying forward. First, workflows are natural-language plan templates, not code. This is a deliberate concession to the fuzzier domain — the agent isn’t going to literally replay clicks because clicks depend on the DOM at execution time; it’s going to follow the plan structure and re-decide each click against the live page. The procedural store’s value is the plan, not the transcript. Second, abstraction is the value-add, not the storage. AWM’s contribution isn’t the storage layer (which is just a vector index over plan descriptions) but the induction step that turns a successful trial into a generalized workflow. The same shape — induce-then-store — is the right pattern for any non-deterministic domain where literal replay won’t work.

LangMem and Agent Skills — production-shaped procedural memory in 2026

The two most important production frameworks for procedural memory in the current landscape are LangMem and Anthropic’s Agent Skills, and they take notably different shapes.

LangMem’s procedural memory treats procedure as prompt rules updated over time. The framework’s prompt-optimization API explicitly frames procedural memory as “system-prompt rules that get refined based on agent performance.” Three algorithms — metaprompt (reflects on conversations and proposes prompt updates), gradient (separate critique and proposal steps), and prompt_memory (a simpler heuristic) — generate proposed edits to the agent’s system prompt, which the developer applies. The model is that the agent’s behavior procedure (how to respond, what to prioritize, what to avoid) is stored as English rules in the prompt, and the LLM gradually rewrites those rules based on success and feedback signals. This is a different angle on procedural memory than Voyager’s executable-code-cache — it’s rule procedural memory, in CoALA terms, not recipe procedural memory — and it’s the right fit when the procedure is about how to behave rather than how to act.

Anthropic’s Agent Skills is the production-grade port of Voyager-style executable procedures with three production-shaped concessions. First, skills are author-curated, not auto-induced. A human (or a Claude-generated draft) writes the SKILL.md; the framework doesn’t try to extract skills from interaction trajectories on its own (yet). This is a deliberate trade-off: the auto-induction problem is hard and noisy, and a curated skill library produces reliable retrievals. Second, progressive disclosure replaces top-K retrieval as the cost-control mechanism. Instead of “embed every skill description and retrieve the top-K by similarity,” Agent Skills loads every skill’s name and short description into the system prompt and lets the model decide which skill is relevant — leveraging the model’s reasoning rather than a separate embedding lookup. This works well up to ~100 skills and breaks beyond that, which is the upper bound for a single agent’s library before a router pattern is needed. Third, skills can carry their own resources — bundled files, scripts, lookup tables — that load only when the skill is selected. The third level of progressive disclosure means a 5,000-line playbook.md can sit alongside a 200-word SKILL.md without paying its token cost on every turn.

The two approaches are complementary, not competing. LangMem is the right tool for prompt-rule procedural memory — “always speak in this tone,” “always run X before Y,” “never recommend Z without confirming W.” Agent Skills is the right tool for recipe procedural memory — “to deploy the billing service, follow these steps with these flags.” A production agent often needs both: rule procedural memory to shape the always-on behavior, recipe procedural memory to cache the task-specific skills. The architectural split is the same as a CPU’s L1-instruction-cache (rules — always present, always considered) versus a JIT’d-code-cache (recipes — retrieved when the task matches).

Keying schemes — the make-or-break design decision

The single hardest design decision in a procedural store is the embedding key. Five common schemes, each with a distinct failure mode.

Raw user request. Embed exactly what the user typed. Failure mode: surface-word brittleness. “Deploy the user service” and “ship the user service” miss each other. “Can you deploy the user service?” and “deploy the user service please” might cluster together for the wrong reason. Don’t use this for anything past a prototype.

Procedure text. Embed the code or the step list itself. Failure mode: code clustering on irrelevant similarities. Two procedures with similar control flow but different intents will cluster together; two procedures with the same intent but different implementations will cluster apart. The retrieval ranks on what the procedures look like, not on what they do. Wrong layer of abstraction for procedural memory.

Generated task description. Voyager’s choice. After a procedure succeeds, an LLM generates a normalized natural-language description of “what kind of task is this,” and the embedding of that description is the key. Failure mode: description-generator inconsistency — the same task gets two different descriptions on two different generation passes, and the resulting embeddings don’t cluster. Mitigated by using a single description-generator model with a stable prompt, and (in heavier setups) a regularization step that forces similar tasks toward similar descriptions.

Task-signature schema. A structured, multi-field key: {intent, object, environment, constraints} rather than free-text. The embedding key is then either a concatenated string or a structured retrieval over multiple fields. Failure mode: schema rigidity — workloads with task types the schema didn’t anticipate get key-collisions or are forced into the wrong cell. Mitigated by a flexible “free-text annotations” field that captures schema-overflow without breaking the structured retrieval.

Hierarchical procedure embedding. A two-level key: a coarse-grained domain embedding (e.g., “deployment-related”) plus a fine-grained sub-task embedding (“deploy the billing service”). Retrieval is a two-step rerank — filter to the coarse cluster, then rank within. Failure mode: cluster-boundary mismatch — a task that’s near the boundary of two coarse clusters gets routed to one and misses procedures in the other. Mitigated by querying both adjacent coarse clusters when the top-1 confidence is low, and by recalibrating cluster boundaries periodically against the empirical task distribution. The 2025 Procedural Memory Retrieval Benchmark (Liu et al.) reports that LLM-generated procedural abstractions outperform embedding-based methods on cross-context transfer — the cleanest evidence yet for the hierarchical-key approach.

In practice the right scheme is workload-dependent. Voyager runs on (3) and gets away with it because Minecraft’s task vocabulary is small and stable. AWM uses (3) augmented with the workflow’s task-class description. LangMem’s prompt-rule shape sidesteps the question (the “key” is implicit in which prompt section the rule lives in). Agent Skills uses (4) — the name and description are structured fields. A production system in a fuzzy domain (general-purpose assistants, enterprise knowledge work) usually ends up at (4) or (5) after one rebuild.

Code: Python — a success-gated skill library with task-shape keying

The smallest interesting build: a procedural store with explicit success-gating, generated-description keying, retrieval-at-task-start, and graceful fallback when a retrieved procedure fails. The example uses Chroma for the embedding store and the Anthropic SDK for the description generation and the agent loop. Install: pip install anthropic chromadb.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import json
import time
import uuid
from dataclasses import dataclass, field
from anthropic import Anthropic
import chromadb

client = Anthropic()
chroma = chromadb.Client()
procedures = chroma.get_or_create_collection("procedures")

MODEL = "claude-sonnet-4-5"

@dataclass
class Procedure:
    proc_id: str
    description: str           # normalized task-shape description
    steps: list[str]           # concrete action sequence
    preconditions: list[str]   # what must be true for this to apply
    confidence: float          # promotion tier: 0.3 candidate / 0.7 eligible / 0.95 high
    success_count: int         # number of verified successes
    last_used: float           # for staleness checks
    env_version: str = ""      # dependency stamp; invalidates on env change


def generate_description(task: str, steps: list[str]) -> tuple[str, list[str]]:
    """Distill a normalized task-shape description and preconditions from a successful trace.

    This is the load-bearing step: the description becomes the retrieval key,
    and a poorly-generated one is the entire reason the store doesn't work.
    """
    resp = client.messages.create(
        model=MODEL,
        max_tokens=400,
        messages=[{
            "role": "user",
            "content": f"""I successfully completed this task:
TASK: {task}
STEPS: {json.dumps(steps)}

Generate:
1. A normalized, abstract description of WHAT KIND OF TASK this is
   (e.g., "deploy a service to production environment", not "deploy
   the user service to prod on Monday").
2. A list of preconditions that need to hold for this procedure to apply.

Return JSON: {{"description": "...", "preconditions": ["...", "..."]}}"""
        }],
    )
    text = resp.content[0].text
    parsed = json.loads(text[text.find("{"):text.rfind("}") + 1])
    return parsed["description"], parsed["preconditions"]


def write_procedure_candidate(task: str, steps: list[str], env_version: str = "") -> str:
    """Admit a successful trace as a CANDIDATE procedure.

    Success-gated: only called after the harness has verified the trace succeeded.
    Confidence starts at 0.3 (candidate); promotion to 0.7 happens after 2 more successes.
    """
    desc, preconds = generate_description(task, steps)
    proc_id = f"proc-{uuid.uuid4().hex[:8]}"
    procedures.add(
        documents=[desc],
        metadatas=[{
            "proc_id": proc_id,
            "steps": json.dumps(steps),
            "preconditions": json.dumps(preconds),
            "confidence": 0.3,
            "success_count": 1,
            "last_used": time.time(),
            "env_version": env_version,
        }],
        ids=[proc_id],
    )
    return proc_id


def promote_procedure(proc_id: str):
    """Bump confidence after additional verified successes."""
    existing = procedures.get(ids=[proc_id])
    if not existing["ids"]:
        return
    meta = existing["metadatas"][0]
    meta["success_count"] += 1
    if meta["success_count"] == 3:
        meta["confidence"] = 0.7   # eligible-for-injection
    if meta["success_count"] >= 10:
        meta["confidence"] = 0.95  # high-confidence; skip verification on retrieval
    meta["last_used"] = time.time()
    procedures.update(ids=[proc_id], metadatas=[meta])


def retrieve_procedures(task: str, env_version: str, k: int = 3,
                        min_confidence: float = 0.7) -> list[Procedure]:
    """Recognition-triggered retrieval: runs automatically at task start.

    Filters out stale procedures (env_version mismatch) and procedures below
    the eligibility threshold. Returns the top-K most relevant.
    """
    if procedures.count() == 0:
        return []
    hits = procedures.query(query_texts=[task], n_results=min(k * 3, procedures.count()))
    out: list[Procedure] = []
    for doc, meta, dist in zip(
        hits["documents"][0], hits["metadatas"][0], hits["distances"][0]
    ):
        # cache-coherence check: env mismatch invalidates the entry
        if env_version and meta.get("env_version") and meta["env_version"] != env_version:
            continue
        if meta["confidence"] < min_confidence:
            continue
        out.append(Procedure(
            proc_id=meta["proc_id"],
            description=doc,
            steps=json.loads(meta["steps"]),
            preconditions=json.loads(meta["preconditions"]),
            confidence=meta["confidence"],
            success_count=meta["success_count"],
            last_used=meta["last_used"],
            env_version=meta.get("env_version", ""),
        ))
        if len(out) >= k:
            break
    return out


def run_with_skill_library(task: str, env_version: str = "v1") -> str:
    """Agent loop that consults procedural memory before reasoning from scratch."""
    candidates = retrieve_procedures(task, env_version, k=3)

    if candidates:
        # Inject top procedures as candidate plans
        plans = "\n\n".join(
            f"[procedure {p.proc_id} | confidence {p.confidence:.2f}]\n"
            f"Description: {p.description}\n"
            f"Preconditions: {p.preconditions}\n"
            f"Steps: {p.steps}"
            for p in candidates
        )
        system = (
            "You are an agent with a procedural-memory skill library. "
            "Before reasoning from scratch, consider whether any of the "
            "retrieved procedures matches the task shape. If one does, follow "
            "its steps with appropriate adaptation. If none match, derive from "
            "first principles and the harness will record the new procedure on "
            "verified success.\n\n"
            f"Retrieved procedures:\n{plans}"
        )
    else:
        system = (
            "You are an agent. No procedural matches found for this task; "
            "derive the solution from first principles."
        )

    resp = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": task}],
    )
    return resp.content[0].text


# Example usage
if __name__ == "__main__":
    # Day 1: agent derives the deploy procedure from scratch and succeeds
    deploy_steps_v1 = [
        "checkout main",
        "run pytest -q",
        "build docker image",
        "push to staging registry",
        "wait 45s for staging healthcheck",
        "promote to prod",
        "tag git release",
    ]
    proc_id = write_procedure_candidate(
        task="Deploy the billing service to production",
        steps=deploy_steps_v1,
        env_version="v1",
    )

    # Days 2-4: agent retrieves and re-executes; promote on each success
    for _ in range(2):
        promote_procedure(proc_id)

    # Day 5: a new deploy task arrives; the harness retrieves the cached procedure
    out = run_with_skill_library(
        task="Deploy the analytics service to production",
        env_version="v1",
    )
    print(out)

The shape that matters. First, the write is gated: write_procedure_candidate is only ever called from a code path that has verified the task succeeded — the gate isn’t shown in the snippet because it’s the harness’s responsibility, but the contract is “no success, no write.” Second, the retrieval is at task start, not behind a model-mediated decision. The run_with_skill_library function consults the store before it constructs the prompt; the model gets the candidate plans injected as part of the system message rather than having to call a “look up a procedure” tool. Third, the env-version stamp is the cache-coherence mechanism. When the build system upgrades from v1 to v2, the harness re-stamps the env version, and old procedures fail the version check at retrieval time. Fourth, promotion is gradual. The same procedure goes from candidate (0.3) to eligible (0.7) to high-confidence (0.95) over many successes, mirroring SOAR chunking guards and the JIT compiler’s tiered compilation thresholds.

Code: TypeScript — the same shape with the Vercel AI SDK

The TypeScript port uses the Vercel AI SDK for the model calls and an in-memory store for brevity. The architectural shape is identical to the Python version. Install: npm install ai @ai-sdk/anthropic.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import { generateText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";

interface Procedure {
  procId: string;
  description: string;
  steps: string[];
  preconditions: string[];
  confidence: number;        // 0.3 candidate / 0.7 eligible / 0.95 high
  successCount: number;
  lastUsed: number;
  envVersion: string;
  // embedding stored separately in a real system; here we use a coarse proxy
  descriptionEmbeddingProxy: string;
}

const store: Procedure[] = [];

// In production use a real embedding model (text-embedding-3-small, voyage-3, etc.)
// and a vector index. Here we use a coarse string proxy for demonstration.
function embedProxy(text: string): string {
  return text.toLowerCase().replace(/[^a-z0-9 ]/g, "").trim();
}

function cosineProxy(a: string, b: string): number {
  // Cheap Jaccard over tokens — replace with cosine over real embeddings
  const sa = new Set(a.split(/\s+/));
  const sb = new Set(b.split(/\s+/));
  const intersection = [...sa].filter(x => sb.has(x)).length;
  const union = new Set([...sa, ...sb]).size;
  return union === 0 ? 0 : intersection / union;
}

async function generateDescription(
  task: string,
  steps: string[],
): Promise<{ description: string; preconditions: string[] }> {
  const { text } = await generateText({
    model: anthropic("claude-sonnet-4-5"),
    prompt: `I successfully completed this task:
TASK: ${task}
STEPS: ${JSON.stringify(steps)}

Generate:
1. A normalized, abstract description of WHAT KIND OF TASK this is.
2. Preconditions for this procedure to apply.

Return JSON: {"description": "...", "preconditions": ["...", "..."]}`,
  });
  const jsonStart = text.indexOf("{");
  const jsonEnd = text.lastIndexOf("}") + 1;
  return JSON.parse(text.slice(jsonStart, jsonEnd));
}

async function writeProcedureCandidate(
  task: string,
  steps: string[],
  envVersion: string,
): Promise<string> {
  const { description, preconditions } = await generateDescription(task, steps);
  const procId = `proc-${crypto.randomUUID().slice(0, 8)}`;
  store.push({
    procId,
    description,
    steps,
    preconditions,
    confidence: 0.3,
    successCount: 1,
    lastUsed: Date.now(),
    envVersion,
    descriptionEmbeddingProxy: embedProxy(description),
  });
  return procId;
}

function promoteProcedure(procId: string): void {
  const proc = store.find(p => p.procId === procId);
  if (!proc) return;
  proc.successCount += 1;
  if (proc.successCount === 3) proc.confidence = 0.7;
  if (proc.successCount >= 10) proc.confidence = 0.95;
  proc.lastUsed = Date.now();
}

function retrieveProcedures(
  task: string,
  envVersion: string,
  k = 3,
  minConfidence = 0.7,
): Procedure[] {
  const queryEmbed = embedProxy(task);
  const scored = store
    .filter(p => !envVersion || !p.envVersion || p.envVersion === envVersion)
    .filter(p => p.confidence >= minConfidence)
    .map(p => ({ p, score: cosineProxy(queryEmbed, p.descriptionEmbeddingProxy) }))
    .sort((a, b) => b.score - a.score);
  return scored.slice(0, k).map(s => s.p);
}

async function runWithSkillLibrary(task: string, envVersion = "v1"): Promise<string> {
  const candidates = retrieveProcedures(task, envVersion, 3);
  let system: string;

  if (candidates.length) {
    const plans = candidates
      .map(p =>
        `[procedure ${p.procId} | confidence ${p.confidence.toFixed(2)}]\n` +
        `Description: ${p.description}\n` +
        `Preconditions: ${JSON.stringify(p.preconditions)}\n` +
        `Steps: ${JSON.stringify(p.steps)}`,
      )
      .join("\n\n");
    system =
      "You are an agent with a procedural-memory skill library. Before " +
      "reasoning from scratch, consider whether any retrieved procedure " +
      "matches the task shape. If one does, follow its steps with appropriate " +
      "adaptation. If none match, derive from first principles.\n\n" +
      `Retrieved procedures:\n${plans}`;
  } else {
    system =
      "You are an agent. No procedural matches found for this task; " +
      "derive the solution from first principles.";
  }

  const { text } = await generateText({
    model: anthropic("claude-sonnet-4-5"),
    system,
    prompt: task,
  });
  return text;
}

// Example
async function main() {
  const procId = await writeProcedureCandidate(
    "Deploy the billing service to production",
    [
      "checkout main",
      "run pytest -q",
      "build docker image",
      "push to staging registry",
      "wait 45s for staging healthcheck",
      "promote to prod",
      "tag git release",
    ],
    "v1",
  );

  promoteProcedure(procId);
  promoteProcedure(procId);

  const out = await runWithSkillLibrary(
    "Deploy the analytics service to production",
    "v1",
  );
  console.log(out);
}

main();

The TypeScript version stays deliberately schema-equivalent to the Python version — same five operations, same confidence-tiering, same env-version-stamped cache-coherence check. The only meaningful difference is the embedding proxy: a real system replaces cosineProxy over Jaccard tokens with cosine similarity over a real embedding model (Voyage, OpenAI, or Cohere), and replaces the in-memory store array with a vector index like Chroma, Qdrant, or pgvector. The architectural shape is what carries.

Trade-offs, failure modes, and gotchas

The single-success cache poisoning trap. Write every first success straight to the high-confidence tier — as Voyager does in Minecraft — and the store accumulates over-conditioned, brittle procedures. The retrieval surfaces them confidently; the agent follows them; failures look like “the agent confidently does the wrong thing.” The mitigation is the gradual-promotion ladder from the snippet above, calibrated to the workload’s stochasticity. Deterministic domains (Minecraft, deterministic API workflows) can use a single-success threshold; fuzzy domains (customer support, code review with human reviewers) need at least 3 successes before promotion and a verification pass on retrieval.

The over-specific procedure problem. A successful procedure gets stored with all the surface-level conditions of the success trace — the user’s name, the specific URL, the exact wording of the request — and the retrieval embedding ends up too narrow to match anything else. The mitigation is the generated-description step from the snippet: an LLM call after success generates a normalized description that drops the surface context. The description-generator’s prompt is load-bearing — “describe what kind of task this is” produces general descriptions; “describe what happened in this conversation” produces narrow ones. Most procedural-memory implementations that don’t pay off in production have a description-generator that’s anchored on the trial, not on the task class.

The stale-procedure execution path. A cached procedure was correct when stored; the environment has since changed; the agent retrieves and executes the procedure; the execution fails. The cleanest mitigation is the env-version stamp from the snippet, but the harder version of the problem is: not every environment change is captured by an explicit version stamp. The build system was upgraded; the API key rotated; the deploy target moved to a different region. The pattern is the same as cache invalidation in any distributed system — explicit versioning catches the changes you anticipate; runtime verification catches the ones you don’t. A defensible production pattern is cheap pre-execution verification: before the agent runs a retrieved procedure, run a low-cost check (does the deploy command still exist? is the staging environment reachable?) and bail to first-principles reasoning if the check fails. The procedure that fails verification gets demoted (confidence reduced, success count flagged) rather than executed.

The procedure-conflict problem. Two procedures match the same task with similar confidence and similar embedding distance but produce different actions. The first deployment-related procedure was written when the team used Helm; the second was written after they switched to Kustomize; both got cached with overlapping descriptions. The retrieval surfaces both, and the model has no principled way to choose. The mitigation has two parts. First, write richer preconditions into the stored procedure — “deployments where the target uses Helm 3.x” — so the model has discriminating information at retrieval time. Second, run a consolidation pass (the same shape as the contradiction consolidator on the semantic side) that detects pairs of procedures with overlapping descriptions but divergent steps, flags them for review, and either merges the eligible cases or marks the older one as superseded. Without the consolidation pass, the store accumulates noise faster than retrieval can discriminate.

The success-attribution problem. The agent succeeds at a task. Which subsequence of actions actually caused the success? Storing the entire trajectory produces over-conditioned procedures (the irrelevant actions become part of the cached recipe); storing only the obvious load-bearing steps requires deciding which were load-bearing. Voyager sidesteps the problem by storing whole successful code (Minecraft is forgiving), AWM sidesteps it by inducing higher-level workflows (the abstraction throws out the irrelevant details by design), and a third option is to use a minimization pass: ablate steps from the recorded trace, replay against a sandbox, and keep only the steps whose ablation broke the task. The pass is expensive but produces clean procedures; production systems run it in a sleep-time consolidation batch rather than inline.

The abstraction-level-collapse problem. A library accumulates two flavors of procedure: very specific recipes (“deploy the billing service to prod-us-east-1”) and very general workflows (“deploy any service”). Retrieval on a new task surfaces both, with the specific one ranked higher by surface similarity to the trial that wrote it. The agent picks the specific one even when the general one is more transferable. The mitigation is hierarchical procedure embedding — store both the specific procedure and the abstract workflow, link them with a parent/child relationship, and prefer the abstract one at retrieval time if the abstract one’s preconditions match the task. The pattern is the same as a method-resolution order in OO programming: walk from specific to general, pick the first whose preconditions match.

The retrieval-injection-context-bloat trade-off. Injecting the top-5 procedures’ full text into every prompt blows up the context budget. The mitigation has three shapes. First, top-K reduction — inject only the top-1 or top-2 if confidence is high enough; fall back to top-5 only when confidence is borderline. Second, progressive disclosure in the Anthropic Agent Skills sense — inject only the description and a one-line summary; let the model decide whether to “load” the full procedure via a tool call. Third, template-only injection — inject the action types and variable names but not the example values; the model fills the variables at execution time. Each shape trades retrieval depth for token cost; the right choice depends on whether the workload’s bottleneck is token cost (use progressive disclosure) or model confidence (use full injection).

The skill library doesn’t learn from failure. Every framework discussed in this article writes on success and is silent on failure. A procedure that consistently fails to apply (or applies and produces wrong outcomes) sits in the store with its success_count from older trials, getting retrieved and consumed token budget without paying off. The mitigation is a failure-counted demotion: on each retrieval, the harness tracks whether the procedure was actually used and whether the resulting task succeeded; failed uses decrement confidence, repeated failures demote the procedure to inactive, persistent failures evict it. The pattern is the same as a memory-write/forgetting policy applied to procedures, and most production procedural stores ship without it because the engineering complexity of plumbing the failure signal back to the store is non-trivial.

The cross-user contamination problem. A procedure that worked for user A — “deploy the billing service using their Helm chart and their staging environment” — gets cached and retrieved for user B, whose deployment setup is completely different. The retrieval embedding doesn’t carry the user-scope, so the cross-user retrieval is just a similarity hit. The mitigation has two shapes. First, scope by namespace: the procedural store is partitioned per user (or per project, per team), and retrieval is scoped to the active partition. This is the safer default but loses the cross-user generalization benefit. Second, scope by precondition: store user-specific preconditions in the procedure metadata and filter on retrieval. This preserves the cross-user benefit (a general “how to deploy a service” workflow remains transferable) at the cost of more careful precondition modeling. Multi-tenant agent systems almost always need both — namespace-scoped recipes plus globally-scoped workflows. The cross-session identity article treats the user as a first-class object in the memory architecture — the durable record that makes namespace-scoping concrete, the persona-clock model that distinguishes “same human, different scope,” and the deletion path that has to handle “delete every procedure scoped to this user” cleanly.

The hot-path verification cost. Every retrieved procedure ideally goes through a pre-execution verification before the agent runs it (does the environment match? are the dependencies still present?). For high-confidence procedures used many times a day, the verification cost dominates the cache-hit benefit. The mitigation is to skip verification for procedures above the high-confidence threshold (0.95 in the snippet) on recent retrievals; only verify periodically (every N retrievals, or every 24 hours). The pattern is the same as a CPU’s branch-predictor — trust the cache when it’s been right consistently, re-verify when it’s been wrong recently. The verification budget is a separable knob that production systems should tune by workload, not leave at the default.

Further reading

  • Voyager: An Open-Ended Embodied Agent with Large Language Models — Wang et al., 2023 — the canonical reference for the executable-skill-library shape of procedural memory. §3 (skill library design) and §5 (results) are the must-reads. The reported 15.3× speedup on tech-tree milestones and the qualitative analysis of how skills compose are the strongest empirical evidence that the tier is worth building.
  • Agent Workflow Memory — Wang, Mao, Fried, Neubig, 2024 — the workflow-induction port of Voyager to web-agent settings. The §3 induction algorithm and the §4 experimental results (24.6% Mind2Web, 51.1% WebArena improvements) are the cleanest case for the abstract-workflow shape of procedural memory in non-deterministic domains.
  • Equipping agents for the real world with Agent Skills — Anthropic, October 2025 — the production-engineering view of procedural memory as authored, progressively-disclosed skill folders. The progressive-disclosure mechanic (name → SKILL.md → bundled files) is the answer to the context-budget problem that every procedural store eventually hits.
  • LangMem documentation — Long-term Memory in LLM Applications — the prompt-rule shape of procedural memory, with the prompt_optimization API as the production-ready substrate. The conceptual guide makes the distinction between procedural-as-rules and procedural-as-recipes explicit; the API reference walks the metaprompt, gradient, and prompt_memory algorithms.
  • A Benchmark for Procedural Memory Retrieval in Language Agents — 2025 — the first benchmark that isolates procedural-memory retrieval from task execution. The headline result — LLM-generated procedural abstractions outperform raw embedding-based methods on cross-context transfer — is the cleanest available evidence for the description-generation step that the Voyager paper introduced and that this article’s snippet operationalizes.
  • The Cognitive Taxonomy: Semantic, Episodic, Procedural — the parent article that named the four CoALA memory types and gave procedural memory its JIT-compiled-routine-cache parallel. Today’s piece is the deep dive on the fourth tier; the taxonomy article is the right place to start if you came in without the four-type vocabulary.
  • Memory Write Policies: What’s Worth Remembering — the upstream layer that decides which successful trajectories become candidate procedures. The success-gate from this article is one specific instantiation of the four-stage admission pipeline that the write-policy article walks in general; procedural admission is the strictest cell in the matrix.
  • Memory Conflict, Forgetting, and Embedding Drift — the consolidator and forgetting policies discussed here are the procedural-side mirror of the contradiction-resolution and active-forgetting machinery that piece walks in detail for episodic and semantic memory. The patterns transfer; the metadata you score against is different.
  • Cross-Session Identity and Personalization — the layer that makes the cross-user contamination problem from this article tractable. Treats the user as a first-class object in the memory architecture, with a durable typed record, a session-start materialization step, the cold-start staircase, the persona-clock model, and the deletion path. The user-scoping that the procedural store needs at retrieval time is what that article’s identity layer provides.