Jatin Bansal — Backend, Distributed Systems & AI Engineering
Engineering Lead at Pluang, building crypto exchange, trading and investment platform. 7+ years in financial services - mostly distributed systems work with Kafka, Redis, and event-driven microservices.
Now learning to be an AI engineer. Notes under /ai-engineering.
Memory Write Policies: What's Worth Remembering
Memory write policies: distillation, write amplification, the journaling-vs-checkpoint trade-off, learned classifiers, and admission control for agents.
Hierarchical Memory: Working / Episodic / Semantic Tiers
Hierarchical memory: MemGPT/Letta's three-tier OS-paging model, what lives in core/recall/archival, and the promotion-demotion policies that bind them.
Knowledge Graphs as Structured Memory
When graphs beat vectors as memory: entities, relations, bi-temporal validity, Graphiti/Zep/Mem0g patterns, and hybrid graph+vector retrieval.
Long-Term Memory: Vector-Backed Episodic Storage
Long-term episodic memory: vector-backed storage, episode boundaries, recency-weighted retrieval, the WAL parallel, and the unit-of-recall problem.
Working Memory: Scratchpads, Blackboards, and Agent Notebooks
Working memory for agents: scratchpads, blackboards, notebooks, and dataflow state — the in-context surface that sits above the conversation buffer.
Short-Term Memory: Managing the Conversation Buffer
Truncation policies for the LLM conversation buffer: sliding windows, token-level vs message-level eviction, system-prompt protection, headroom budgeting.
The Cognitive Taxonomy: Semantic, Episodic, Procedural
A close read of the four cognitive memory types — working, episodic, semantic, procedural — and the CPU cache hierarchy each one maps onto.
The Memory Stack: A Map of AI Memory
A map of AI agent memory: in-context vs storage, the four cognitive types, the write/read/maintain axes, and why memory isn't RAG with a longer leash.
Tool Selection at Scale: MCP and Dynamic Routing
Why tool selection collapses past 30 tools, and how MCP, lazy loading, and retrieval keep accuracy high across thousands of tools without context bloat.
Multi-Agent Orchestration
Supervisor, swarm, and hierarchical multi-agent patterns: the A2A protocol, split-brain failure modes, the 15x token tax, and when not to reach for it.
Planning Agents vs Reactive Agents
When to plan ahead vs react step by step: ReAct vs plan-and-execute vs Tree-of-Thoughts, the cost of replanning, and the speculative-execution parallel.
The Agent Loop: ReAct and Its Descendants
How the agent loop actually works: ReAct's thought/action/observation cycle, plan-and-execute, stopping conditions, and the leader-election parallel.
Constrained Decoding: Grammars, Regex, and FSMs
How constrained decoding works: vocabulary masking, FSMs and pushdown automata, GBNF grammars, XGrammar/Outlines/llama.cpp, and the format tax.
Prompt Caching: Reusing the KV Cache Across Calls
How prompt caching reuses the KV cache across API calls: Anthropic breakpoints, OpenAI's automatic prefix cache, Gemini context cache, and cost math.
Streaming and Backpressure
Token-by-token LLM streaming end to end: SSE vs WebSockets, partial JSON parsing, cancellation with AbortController, and where backpressure actually bites.
Function Calling and Tool Use
Tool use is typed RPC for LLMs: tool schemas, the call-result loop, parallel calls, tool_choice, OpenAI vs Anthropic differences, and failure modes.
Structured Output: JSON Mode and Schema Coercion
Reliable JSON from LLMs: JSON mode vs strict json_schema vs tool use vs retry-on-validate, with Instructor and the Vercel AI SDK in practice.
Context Engineering: JIT vs AOT Context Loading
Context as the scarcest resource in an LLM call: how AOT prepacking and JIT retrieval compose, and the OS prefetch-vs-demand-paging parallel.
RAG Evaluation: Recall, Faithfulness, and Answer Quality
Retrieval metrics, generation metrics, and the judge problem: how to evaluate a RAG pipeline end-to-end with recall@k, faithfulness, and Ragas.
Query Transformations: Rewriting, HyDE, and Multi-Query
The query-side preprocessing layer for RAG: how rewriting, HyDE, multi-query, decomposition, and step-back prompting trade cost for recall.
Reranking: Cross-Encoders and Cascades
Why cross-encoders dominate the precision stage of retrieval, when a reranker pays off, and how to compose cascades that respect the latency budget.
Hybrid Search: BM25 Meets Dense Vectors
Why dense retrieval misses rare terms and exact matches, how BM25 and embeddings fuse via RRF, and the hybrid patterns that ship in production.
Chunking Strategies for Retrieval
Why chunk size is RAG's most undertuned variable, how recursive, semantic, and structural chunking differ, and when parent-document retrieval wins.
Vector Databases & ANN Indexes
How HNSW, IVF, and ScaNN trade recall for speed, why exact KNN doesn't scale, and how to pick between pgvector, Qdrant, and Pinecone in production.
Text Embeddings: Turning Meaning into Geometry
How embedding models encode text as dense vectors, why cosine similarity captures meaning, and how to build semantic search in Python and TypeScript.
LLM Inference: Tokens, Context, and Sampling
How LLMs process text: BPE tokenization, the context window as working memory, KV caching, and sampling parameters that shape output variance.
Writing Event Loops with Java Virtual Threads
A practical guide to writing small event loops in Java 21 and Java 25 using virtual threads, blocking queues, direct control flow, and graceful shutdown.
Context vs Prompt Engineering: The Evolution from Instructions to Intelligence
Exploring the shift from prompt engineering to context engineering in AI systems, understanding context rot, and why managing context is becoming more critical than crafting prompts.
StampedLock: How to Use Locks with Near Lock-Free Reads in Java
Learn how Java’s StampedLock enables near lock-free reads with optimistic locking, why it’s useful for virtual threads and read-heavy workloads, and how to use it safely.