Knowledge Graphs as Structured Memory
When graphs beat vectors as memory: entities, relations, bi-temporal validity, Graphiti/Zep/Mem0g patterns, and hybrid graph+vector retrieval.
A customer-success agent built on vector-backed memory ships well for six months and then quietly breaks in a way nobody catches in the dashboard. The user asks: “When my manager Priya asked for the Q1 forecast last March, what version of the model were we using?” The retrieval pass returns five episodes that mention forecasts, two that mention Priya, none that connect “Priya is my manager,” “March forecast version,” and “the model we used then” — even though all three facts are individually in the store. The agent answers confidently with the current model version and the most recent forecast it can find, both wrong for the question. Vector retrieval did its job: it returned textually similar episodes. The question it couldn’t answer was a multi-hop question with a temporal constraint over a relational structure, and that is the shape of question knowledge graphs exist to answer. This article is about when graphs are the right memory substrate and when they aren’t, and how the production frameworks (Graphiti, Zep, Mem0g) actually draw the line.
Opening bridge
Yesterday’s piece on long-term memory built the vector-backed episodic store: a write-ahead log with semantic indexing, recency-weighted retrieval, the α·recency + β·importance + γ·similarity score from Generative Agents. That substrate is the workhorse, and most production agents will ride on it for a long time. But the failure modes that piece flagged — the “last write wins” silent corruption, the conflict between “user prefers Postgres” written in March and “team migrated to Cassandra” written in May, the multi-hop questions that no top-K cosine retrieval can stitch together — are the symptoms of a missing primitive: an explicit model of entities, the relations between them, and the time intervals over which each fact was true. Knowledge graphs are that primitive. Today’s piece is the deep dive on graph-shaped memory: what it adds over vectors, what it costs, the bi-temporal model that the Zep paper (Rasmussen et al., 2025) popularized, and the hybrid graph+vector retrieval pattern that has converged across every production framework.
Definition
A knowledge graph as memory is a typed, persistent store of entities and the relations between them, where each relation can carry temporal validity, provenance, and confidence metadata, and where retrieval is a graph-traversal-plus-search problem rather than a pure similarity-search problem. Four properties separate it from the vector-only episodic store. First, it is entity-resolved — “Priya” written in one episode and “my manager Priya” in another and “Priya Sharma” in a CRM-imported row all resolve to the same graph node, not to three separate vectors that happen to have high cosine similarity. Second, it is relationally indexed — the edge “user MANAGED_BY Priya” is a first-class object you can traverse, not a fact embedded inside a text blob you have to retrieve and re-parse on every read. Third, it is temporally aware — each edge carries a valid_from/valid_to interval (and ideally a separate ingested_at timestamp), so the question “what was true in March?” is a structured query, not a heuristic over recency-weighted vector scores. Fourth, it is traversable — the read path can answer multi-hop questions (“who are the direct reports of the person who approved the Q1 forecast?”) by walking edges, which a vector store fundamentally cannot.
What graph memory is not: it is not a vector store with a graph veneer (the vector store gives you fuzzy semantic lookup; the graph gives you exact relational structure — both are needed in production). It is not GraphRAG in the Microsoft sense (that’s a one-shot extraction-and-summarization pipeline over a static corpus, optimized for global queries over documents; agent memory graphs are incrementally updated on every conversation turn and optimized for entity-grounded queries over a growing user history). It is not a semantic memory store with structured tags either — that’s a flat key-value table; the graph’s value is in the edges, not in tagging individual facts.
Intuition
The mental model that pays off is the indexed table versus the full-text search index in a relational database. A Postgres table with a B-tree index on user_id answers “give me all the orders for user 42” in microseconds; a tsvector index on the order notes answers “give me all orders that mention ‘refund’ in any phrasing” with a fuzzy score. Both indexes describe the same data, but they answer fundamentally different question shapes — and no production system tries to answer the structured query through the fuzzy index, or vice versa, because the asymptotic cost and quality are wrong on both axes.
Knowledge-graph memory is the indexed structural query path; vector memory is the fuzzy semantic query path. “Did the user say they were vegetarian?” is a fuzzy question — vector retrieval pulls episodes mentioning food, dietary preferences, restaurants, and the model assembles an answer from the candidates. “Who reports to whom?” or “what was the user’s role at company X before they joined company Y?” or “what was the active version of the forecast model in March before the migration?” are structured questions — the right answer is a graph traversal with a temporal filter, and asking a vector store to do it produces the kind of confident-sounding wrong answer that ends careers.
The two indexes are not interchangeable, and the design question for graph memory is not “graph or vector” — it is “what subset of the agent’s memory has enough relational structure that an explicit graph pays for itself, and what subset is better left in the fuzzy vector tier?” The answer to that question is workload-dependent and the focus of the rest of this article.
The distributed-systems parallel
Three parallels, each load-bearing.
Knowledge graphs are the indexed-table query path; vector memory is the full-text-search fallback. The same dualism that Postgres ships with — CREATE INDEX for structured access patterns, tsvector for fuzzy text — repeats one layer up for agent memory. The graph index is built at write time (entity extraction, relation extraction, deduplication against existing nodes); the vector index is built at the same time (embed the raw episode). Reads go to whichever index can answer the question at lower cost and higher quality. Production frameworks that try to force every read through one index degrade in predictable ways: pure-vector systems fail multi-hop and temporal questions; pure-graph systems fail open-ended semantic questions because the graph can’t represent the long tail of fuzzy facts that don’t fit any predefined relation type. The 2026 frameworks have converged on hybrid: graph first when the query has structure, vector first when it doesn’t, with a fusion layer when neither is decisive.
The bi-temporal model is the same trick as a database with both valid_time and transaction_time columns. Temporal databases have shipped this idea for decades: track when a fact was true in the world (valid time) separately from when the system learned about it (transaction time). The reason agent memory needs both is the same reason audit-heavy financial systems do: corrections happen late (“oh, actually that started in March, not April”) and you need to be able to answer two distinct historical queries — “what did the system believe on day X?” vs “what was actually true on day X according to our best current understanding?” Graphiti’s bi-temporal model ports this directly: every edge carries (t_valid, t_invalid) for world time and (created_at, expired_at) for transaction time. When a new fact contradicts an existing one, Graphiti doesn’t delete the old edge — it stamps the old edge’s invalid_at, creates a new edge with the corrected validity, and the old fact is still queryable for “what did we believe?” audits. The same property that makes a transactional WAL recoverable makes a bi-temporal graph defensible against the “we silently overwrote a fact” failure mode that the long-term memory article flagged as the most common silent-corruption bug in semantic stores.
Entity resolution is the same problem as deduplication in a distributed log. Every event stream eventually has to answer “are these two records the same logical thing?” — same user across two devices, same product across two catalog imports, same person mentioned in three episodes under three spellings. The classic data-engineering answer is a deterministic ID plus a fuzzy fallback (exact match on a stable identifier; embedding similarity plus rules for the rest). Graph memory inherits the same problem and the same solution shape: each new entity extracted from an episode is checked against existing nodes (by name, by embedding, by attributes), and either resolved to an existing ID or created as a new node. Getting this wrong in either direction is the source of two distinct failure modes — under-merging leaves the graph fragmented into duplicate nodes that never get connected; over-merging collapses distinct entities into one and corrupts every edge that touches them. Production frameworks tune resolution aggressively because the entire value of a graph depends on it.
When graphs beat vectors (and when they don’t)
The 2026 empirical picture is sharper than it was a year ago. Three question classes where graph memory clearly outperforms vector-only memory:
Multi-hop relational queries. “What was the version of the forecast model used by the team Priya manages?” requires traversing user→manages→team→uses→model. Pure-vector retrieval has no way to compose three independent retrievals into a coherent answer; the best it can do is retrieve top-K episodes for the surface form and hope the model can stitch them. Graph traversal walks the edges directly and returns a single, grounded answer.
Temporal point-in-time queries. “What was the user’s role at company X as of March 2026?” is unanswerable from a vector store that retrieves on recency-weighted similarity alone — the most recent role will always rank higher, regardless of the asked-about date. A bi-temporal graph filters by valid_from ≤ date ≤ valid_to and returns the structurally correct answer. The Zep paper benchmark reports an 18.5% improvement over MemGPT on the Deep Memory Retrieval benchmark and an 18% accuracy lift on LongMemEval, with the largest gains concentrated in temporal-reasoning categories.
Entity-centric aggregation. “Summarize everything I know about Priya” is a query against the neighborhood of a single entity node. A graph returns the entity and its first-hop edges in one traversal. A vector store has to retrieve every episode mentioning “Priya,” paginate through them, and deduplicate — the cost grows linearly with the user’s history while the graph version stays constant in the size of the entity’s neighborhood.
Three question classes where vectors still win:
Open-ended semantic recall. “Have I ever mentioned anything about my food preferences?” is a fuzzy question whose target facts don’t fit any specific predefined relation type. The graph would need a MENTIONS_PREFERENCE edge and an exhaustive ontology of preference types; the vector store just retrieves on cosine similarity over “food preferences” and works.
Long-tail facts the extractor missed. Entity-and-relation extraction is itself an LLM call with non-zero error rate. Facts the extractor misclassified (or skipped entirely) are invisible in the graph but still searchable in the raw vector store. Pure-graph systems lose recall on exactly the long tail that pure-vector systems handle best.
Low-cardinality or short-lived agents. A graph carries an irreducible per-episode write cost: entity extraction, relation extraction, deduplication against the existing graph. For a single-session agent, or a workload where the user history never exceeds a few hundred turns, the graph’s payoff is smaller than its overhead. The Mem0 paper’s empirical comparison (Chhikara et al., 2025) shows the Mem0g variant adds roughly 2 percentage points of accuracy on LOCOMO over the pure-vector Mem0 (68.4% vs 66.9%) at 53% higher p50 latency (1.09s vs 0.71s) and meaningfully higher token cost — worth it for relational workloads, overkill for short-conversation chatbots.
The production verdict, which every 2026 framework converges on: run both, route on query shape, fuse on disagreement.
The bi-temporal model in detail
The single most underappreciated detail in graph memory is the bi-temporal part. Most engineers reach for graphs to encode entities and relations; they often skip the temporal half because their first benchmark doesn’t stress it. Six months later, the same agent is silently returning stale answers, and the team rediscovers what temporal-database designers have known for forty years.
Two clocks, both load-bearing:
Valid time — when the fact was true in the world. “Priya managed user-42 from 2025-09-15 to 2026-04-01.” This is the clock that answers “what was true in March?”
Transaction time — when the system learned the fact. “We ingested the ‘Priya managed user-42 ending 2026-04-01’ update at 2026-04-08 10:33.” This is the clock that answers “what did the system believe on March 30?”
The two clocks separate because corrections happen out of order. The user might tell the agent on April 15 that Priya stopped being their manager on April 1; the world-time fact is “true until 2026-04-01,” but the system only knew that as of 2026-04-15. An audit query “what did the agent think on April 5?” must return the old belief (Priya still the manager), not the corrected one. Single-temporal systems can’t make that distinction. Graphiti and the broader Zep architecture ship with both clocks; most hand-rolled graph implementations don’t, and then have to bolt the second clock on later when the first audit-driven question gets asked.
Edge invalidation works the same way: when a new fact contradicts an existing one, the old edge’s t_invalid (valid time) and expired_at (transaction time) get stamped, a new edge with the corrected (t_valid, t_invalid) and a fresh created_at gets inserted, and the old edge is preserved for historical queries. The contradiction-detection step is itself an LLM call — pattern-matching for “X is now Y” or “X used to be Y but is now Z” — and it’s where most of the framework’s value lives. A graph without contradiction detection is a graph that silently accumulates contradictory facts and degrades the same way the naive last-write-wins semantic store does.
Hybrid retrieval: graph + vector + keyword
The 2026 production pattern is consistent across every framework that has published numbers: a hybrid read path that runs three retrievals in parallel and fuses the results.
- Graph traversal. Start from one or more entity nodes (extracted from the query), walk up to N hops, return the subgraph with edges valid at the query time.
- Vector similarity. Embed the query, retrieve top-K episodes/facts by cosine similarity from the vector index (filtered by tenant).
- Keyword/BM25. Sparse-retrieval pass over the same corpus to catch exact-term matches the embedding might miss — same idea as hybrid search in the RAG subtree.
The fusion step is some form of reciprocal-rank fusion (RRF) over the three result lists, optionally followed by a reranker. Zep’s published architecture follows this shape and reports a P95 retrieval latency of ~300ms with no LLM calls in the read path itself (the LLM work is concentrated at write time, in the entity-and-relation extraction). The trade-off is explicit and exactly what the context-engineering article framed as the JIT-vs-AOT question: pay LLM cost at write time so the read path can stay cheap, or pay LLM cost at read time and skip the heavyweight write pipeline. Graph memory makes the write-heavy choice.
The query-routing question — “which retrieval path do I trust for this question?” — has two defensible answers. The conservative answer is always run all three and fuse (Zep’s default); the aggressive answer is to classify the query first (with a small-model call) into “structural,” “fuzzy,” or “mixed,” and run only the relevant path. The conservative answer wastes some retrieval cost on every read; the aggressive answer adds a small-model latency tax to every read and can mis-route on ambiguous queries. Production systems lean conservative because the fusion cost is cheap relative to the model call that follows.
Code: Python — Graphiti for entity-and-temporal memory
The smallest interesting build: ingest episodic content into a temporal knowledge graph, then run a hybrid retrieval over it. Uses Graphiti (the open-source temporal-graph engine from Zep) against a Neo4j backend. Install: pip install graphiti-core neo4j and run docker run -d -p 7687:7687 -p 7474:7474 -e NEO4J_AUTH=neo4j/password neo4j:5. You also need OPENAI_API_KEY set for the entity-extraction LLM calls (Graphiti uses OpenAI by default and supports Anthropic via configuration).
| |
Three things to notice. First, reference_time is the valid-time anchor, not the wall-clock time of the ingest — Graphiti uses it to stamp the extracted facts. If you ingest a conversation that happened last week, pass last week’s timestamp, not now(). Second, group_id is the tenant scope — every read and write goes through it, the same multi-tenant pattern as the LangGraph namespace tuple from yesterday’s piece, and it is the structural property that prevents cross-user leakage. Third, the read path has no LLM call — search() is graph traversal + vector + BM25 fused with RRF; the LLM work happened at write time when entities and relations were extracted. This is the JIT/AOT split applied to memory: pay at write time to keep reads cheap.
When the read runs against the third (April) episode, Graphiti’s contradiction-detection pass stamps the old MANAGES edge from Priya with invalid_at = 2026-04-15, creates a new MANAGES edge from Devansh with valid_from = 2026-04-15, and preserves both. The “who managed in March?” query filters on valid_from ≤ March ≤ valid_to and returns Priya correctly — the kind of answer pure-vector retrieval cannot produce regardless of how many cosine neighbors it pulls.
Code: TypeScript — Mem0g (graph + vector hybrid) against Neo4j
The TypeScript version uses Mem0’s graph configuration (Mem0g), which writes the same content to both a vector store and a graph store and queries both at read time. Mem0g is the simpler-to-adopt path when you already have a working Mem0 deployment — flip the config, point it at Neo4j, and the same add/search API gains graph-aware retrieval. Install: npm install mem0ai neo4j-driver. You also need a Neo4j instance reachable from your app.
| |
The shape parallels the Python Graphiti example deliberately — both call out the same operations (ingest, search), the same multi-tenancy primitive (group_id / user_id), the same hybrid-retrieval contract. The substantive difference is opinionatedness: Mem0g treats the graph as a secondary index that augments the vector store; Graphiti treats the graph as the primary store with vectors and BM25 as auxiliary indexes inside it. Mem0g’s published numbers (Mem0 paper, Chhikara et al., 2025) show it lifts LOCOMO accuracy from 66.9% (Mem0 vector-only) to 68.4% (Mem0g) — a real but modest gain. Zep’s published numbers (Zep paper, Rasmussen et al., 2025) show its graph-first architecture lifting LongMemEval accuracy by ~18% over MemGPT in the categories where temporal and relational reasoning dominate. The two represent the two ends of the design space: graph-augmented vector store vs. graph-first hybrid.
Production frameworks: how each draws the line
Three frameworks worth knowing by their stance on the graph/vector boundary.
Graphiti / Zep is graph-first. The temporal knowledge graph is the source of truth; the vector index and BM25 index live inside the graph as auxiliary structures over node and edge content. Entity extraction and bi-temporal stamping happen on every write; the read path is pure traversal-plus-search, no LLM in the loop. Best fit: workloads where relational structure dominates (CRM, compliance, healthcare, support escalation) and where audit trails and temporal point-in-time queries are first-class requirements. Cost: heavyweight write pipeline (entity + relation extraction + deduplication + contradiction detection on every episode), proportional to a small-model call per write.
Mem0 with graph_store enabled (Mem0g) is graph-augmented. The vector store remains the primary substrate; the graph is a secondary index that gets populated on the same add() call and consulted on the same search() call. Easier to bolt onto an existing Mem0 deployment; the API surface doesn’t change. Best fit: teams that already have working vector-backed memory and want the multi-hop and entity-resolution lift without rewriting their write pipeline. Cost: marginally higher write latency, marginally higher accuracy on relational queries, no fundamental architectural change.
Microsoft GraphRAG / LazyGraphRAG is graph-as-corpus-index, not graph-as-memory. The system runs a one-shot extraction-and-community-detection pipeline over a static document corpus, then answers global (“summarize the main themes”) and local (“what does the corpus say about X?”) questions over the resulting graph. It is the closest published architecture to “GraphRAG done at scale” but it is not an agent memory framework — there’s no incremental update model, no bi-temporal validity, no per-user tenant boundary. Best fit: RAG over a large but stable corpus where global summarization queries matter (think: “synthesize what 10,000 documents say about climate policy”). Worst fit: agent memory, because the assumption of a stable corpus breaks the moment the agent ingests its first conversation turn.
The pragmatic choice for most teams in 2026 is one of two paths: start with Mem0g if you’re already on Mem0 or want graph-augmented vector memory with minimal architectural change; start with Graphiti if you’re greenfield and your workload’s defining feature is relational/temporal complexity. Both pay for themselves on the right workloads and are over-engineered on the wrong ones — the decision is downstream of how much of your memory’s value lives in entity graphs vs. open-ended episodes.
Trade-offs, failure modes, and gotchas
The entity-resolution failure mode. Under-merging (Priya, Priya Sharma, my manager Priya → three nodes) fragments the graph and breaks every traversal that should have connected them; over-merging (Devansh and Devansh from a different team → one node) corrupts every edge that touches the collapsed entity. Production frameworks tune resolution aggressively but the failure mode is workload-dependent — a workload with many entities that share names (common in enterprise CRM) will always be at the over-merging end of the trade-off. The defensible default is “lean toward under-merging with a periodic dedup pass,” not “lean toward over-merging and hope nobody notices.”
The single-temporal trap. Implementing only valid time (when the fact was true) and skipping transaction time (when the system learned) works fine until the first audit-driven question — “what did the agent believe on day X?” — and then the absence of the second clock is unrecoverable without re-ingesting from raw episodes. The Graphiti and Zep stack ships with both because they learned this lesson early; hand-rolled graph implementations almost always learn it the hard way. Plan for bi-temporal from day one; the storage cost is negligible and the audit cost of retrofitting is enormous.
The extraction-quality ceiling. Graph memory is only as good as the LLM that extracts entities and relations from each episode. A poorly-prompted extractor will miss relations, classify them inconsistently ("MANAGES" vs “IS_MANAGER_OF”), or hallucinate edges that aren’t in the source. The mitigation is a small, prescribed ontology — a fixed set of entity types and relation types that the extractor is constrained to use — and an evaluation harness that scores extraction precision against a labeled golden set. Frameworks that let users define custom entity and edge types (Graphiti’s entity_types/edge_types/edge_type_map parameters) are doing this for you; ad-hoc extraction “with whatever the model decides” silently degrades over months.
The write-latency tax. Every episode now triggers entity extraction, relation extraction, dedup against existing nodes, contradiction detection, and bi-temporal stamping. That’s typically 200-500ms of LLM time per write at production model latencies, on top of the vector write. For chat-paced workloads (one user turn every few seconds) this is invisible; for high-throughput ingest (importing a CRM export, replaying a million-message support archive) it dominates. The mitigation is batch ingestion with parallelism and, for back-fills, a cheaper extraction model — but the underlying cost is real and worth budgeting for.
The graph-can-answer-anything illusion. A team that builds a knowledge graph for the first time will often try to make it the only memory tier. This recapitulates the same mistake teams make with vector RAG: trying to force every query through a single index. Graphs cannot answer open-ended fuzzy questions (“have I ever mentioned anything about food preferences?”) because they only index facts the extractor recognized. Run both indexes; route or fuse on the read path. The 2026 consensus across the Mem0 frameworks, Zep, and the December 2025 Memory in the Age of AI Agents survey is uniform on this point.
The tenant-isolation bug, again. Graph memory has the same multi-tenant failure mode as vector memory, and it bites harder because a single missed scope filter on a graph query can leak relational structure (the org chart, the household graph) that is more identifying than any single fact. Graphiti uses group_id, Mem0 uses user_id, and both are enforced at the API level — but custom code that bypasses the helper and goes straight to Cypher (or to the graph driver) can drop the filter. Audit every direct-Cypher query in your code base specifically for the tenant scope before you ship.
The embedding-drift problem, now on two surfaces. The embedding-drift bug hits the vector index as before. But the graph also stores embeddings (for entity resolution and for the auxiliary vector index over node/edge content), and those drift too when the embedding model changes. A graph re-embed is more expensive than a vector re-embed because every node and edge text needs to be re-vectorized, not just the raw episode text. Pin your embedding model in the build, treat upgrades as schema migrations, and budget for a full graph re-embed when you change them.
The “graph looks great in demos and worse in production” gotcha. Demos use small, clean datasets with crisp entity types. Production graphs accumulate misclassified nodes, ambiguous relations, contradicting facts from low-confidence extractions, and the long tail of episodes that didn’t quite fit the ontology. The maintenance layer (periodic dedup, ontology evolution, contradiction-detection precision audits, low-confidence-edge pruning) is where the production-grade systems pull ahead and where the MVPs silently degrade. The forthcoming memory evaluation article in this subtree will cover the benchmarks specifically; the rule of thumb today is don’t ship a graph memory without an extraction-quality eval, even a small one.
Further reading
- Zep: A Temporal Knowledge Graph Architecture for Agent Memory (Rasmussen et al., 2025) — the paper that defined the bi-temporal model now standard in production graph memory. Reports 18.5% improvement over MemGPT on DMR and an 18% accuracy lift on LongMemEval, with P95 read latency ~300ms and no LLM calls in the read path. The clearest published reference for the graph-first hybrid architecture.
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory (Chhikara et al., 2025) — includes the Mem0g graph variant and the LOCOMO benchmark numbers (66.9% vector-only → 68.4% Mem0g). The strongest published case for “graph-augmented vector” as the lower-friction path to graph memory.
- Graphiti: Knowledge Graph Memory for an Agentic World (Neo4j developer blog) — the cleanest end-to-end walk-through of Graphiti’s write pipeline, including the entity extraction, deduplication, and contradiction-detection passes. Good companion to the API docs.
- Memory in the Age of AI Agents (Hu et al., December 2025) — the 47-author survey that catalogs the field at the end of 2025 and proposes the finer-grained
factual/experiential/working × token-level/parametric/latenttaxonomy. The §4 graph-memory section is the most rigorous published comparison of the production frameworks.
What to read next
- Long-Term Memory: Vector-Backed Episodic Storage — the immediate predecessor and the vector-only substrate this article builds on. Knowledge graphs are the second index over the same store, not a replacement for the first.
- The Cognitive Taxonomy: Semantic, Episodic, Procedural — the upstream framing. Graph memory is the substrate that finally makes semantic memory feel as engineered as episodic — entities, relations, and validity intervals are the structural skeleton of “what is true about the world.”
- The Memory Stack: A Map of AI Memory — the parent map. Graph memory lives on the storage side of the in-context/storage line and the semantic side of the four-type taxonomy; the parent article keeps the whole layout legible.
- Hybrid Search: BM25 Meets Dense Vectors — the retrieval substrate. The graph+vector+BM25 fusion on the read path is exactly the hybrid-search pattern from the RAG subtree, applied one layer up.
- Hierarchical Memory: Working / Episodic / Semantic Tiers — where the graph fits in a tiered architecture. A graph-augmented hierarchical-memory system typically parks the knowledge graph at the cold tier and traverses it via tool calls — the same access pattern as archival_search, with a richer query language.