Jatin Bansal — Backend, Distributed Systems & AI Engineering
Engineering Lead at Pluang, building crypto exchange, trading and investment platform. 7+ years in financial services - mostly distributed systems work with Kafka, Redis, and event-driven microservices.
Now learning to be an AI engineer. Notes under /ai-engineering.
Human-in-the-Loop Feedback Loops for LLM Systems
Turning thumbs, edits, and re-rolls into a data flywheel: capturing user feedback, sampling traces for review, label hygiene, and selective annotation.
Drift Detection and Regression Testing for LLM Systems
Detecting input and output distribution shift in LLM apps, plus the regression-testing protocol for model upgrades: shadow runs, canaries, judge replays.
Production Tracing and Observability for LLM Systems
Distributed tracing for LLM apps in 2026: span shape, OTel GenAI semantics, OpenInference, sampling, and the LangSmith/Langfuse/Phoenix decision.
LLM-as-Judge: Pointwise and Pairwise
How LLM-as-judge works in production: rubrics, pointwise vs pairwise, position/verbosity/self-preference bias, and how to calibrate against humans.
Eval-Driven Development for LLM Systems
Why evals replace unit tests for LLM systems: error-analysis-first workflow, golden sets, the test pyramid, and CI-gate harnesses in Python and TS.
Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti
MemGPT/Letta, mem0, Zep, and Graphiti compared on architecture, write/read paths, benchmarks, and the build-versus-buy decision for production memory.
Memory Evaluation: Benchmarks and Custom Evals
Memory evaluation for agents: LoCoMo and LongMemEval, multi-hop recall, contradiction handling, and how to design a custom eval that catches drift.
Conversation Compaction: Keeping Long Sessions Alive
Conversation compaction in long agent sessions: reactive vs preemptive triggers, cache-aware deletion, circuit breakers, snapshot-rollback, journals.
Anatomy of an Agent Harness
Inside the agent harness: context assembly, tool dispatch, streaming, cache management, error recovery, cost accounting, telemetry — and build-vs-buy.
Long-Horizon Task Reliability
Drift, checkpointing, and recovery in long-running agents: the distributed-saga parallel, when to abort, and the METR doubling curve.
Computer Use and Browser Agents
Screenshot-driven and DOM-driven agents in 2026: action grounding, accessibility tree vs pixel input, sandboxing, prompt injection, OSWorld.
Memory Privacy, Isolation, and Multi-Tenancy
Per-tenant memory isolation for LLM agents: namespace discipline, cross-tenant leak modes, prompt-injection-via-memory, and verifiable GDPR deletion.
Multi-Agent Shared Memory
Shared memory across LLM agents: scoping rules, consistency models, blackboard vs shared-block vs cross-thread store patterns, and the split-brain bugs.
Cross-Session Identity and Personalization
Cross-session identity for LLM agents: user profiles, personas, the cold-start staircase, sensitivity-gated writes, and the deletion path.
Procedural Memory and Skill Caching
Procedural memory for AI agents: caching successful action sequences as a JIT-compiled-routine store. Voyager, AWM, LangMem, Agent Skills.
Memory Conflict, Forgetting, and Embedding Drift
Three failures of agent memory at scale: contradiction handling, active forgetting, and embedding drift — with worked patterns and code.
Temporal Reasoning and Memory Provenance
Temporal reasoning and provenance in agent memory: as-of queries, bi-temporal validity, dated claims, staleness gates, and per-fact source audit trails.
Memory Retrieval Policies: Recency, Relevance, Importance
Memory retrieval policies: the recency-relevance-importance rerank, exponential decay, read-time boosts, and the LRU/LFU/ARC cache parallel for agents.
Sleep-Time Compute and Memory Consolidation
Sleep-time compute for AI agents: background consolidation, the VACUUM parallel, Letta's sleep-time agents, Claude Code's auto-dream, and the cost math.
Summarization and Context Compression
Context compression for LLM agents: recursive summarization, structured note-taking, measuring quality loss, and the log-compaction parallel.
Reflection: From Experiences to Beliefs
Memory reflection: write-time enrichment that turns raw episodes into higher-order beliefs, the Generative Agents reflection loop, and its failure modes.
Episode Segmentation and Salience Scoring
Episode segmentation and salience scoring: prediction-error and topic-shift boundaries, anchored 1-10 importance, the event-sourcing aggregate parallel.
Memory Write Policies: What's Worth Remembering
Memory write policies: distillation, write amplification, the journaling-vs-checkpoint trade-off, learned classifiers, and admission control for agents.
Hierarchical Memory: Working / Episodic / Semantic Tiers
Hierarchical memory: MemGPT/Letta's three-tier OS-paging model, what lives in core/recall/archival, and the promotion-demotion policies that bind them.
Knowledge Graphs as Structured Memory
When graphs beat vectors as memory: entities, relations, bi-temporal validity, Graphiti/Zep/Mem0g patterns, and hybrid graph+vector retrieval.
Long-Term Memory: Vector-Backed Episodic Storage
Long-term episodic memory: vector-backed storage, episode boundaries, recency-weighted retrieval, the WAL parallel, and the unit-of-recall problem.
Working Memory: Scratchpads, Blackboards, and Agent Notebooks
Working memory for agents: scratchpads, blackboards, notebooks, and dataflow state — the in-context surface that sits above the conversation buffer.
Short-Term Memory: Managing the Conversation Buffer
Truncation policies for the LLM conversation buffer: sliding windows, token-level vs message-level eviction, system-prompt protection, headroom budgeting.
The Cognitive Taxonomy: Semantic, Episodic, Procedural
A close read of the four cognitive memory types — working, episodic, semantic, procedural — and the CPU cache hierarchy each one maps onto.
The Memory Stack: A Map of AI Memory
A map of AI agent memory: in-context vs storage, the four cognitive types, the write/read/maintain axes, and why memory isn't RAG with a longer leash.
Tool Selection at Scale: MCP and Dynamic Routing
Why tool selection collapses past 30 tools, and how MCP, lazy loading, and retrieval keep accuracy high across thousands of tools without context bloat.
Multi-Agent Orchestration
Supervisor, swarm, and hierarchical multi-agent patterns: the A2A protocol, split-brain failure modes, the 15x token tax, and when not to reach for it.
Planning Agents vs Reactive Agents
When to plan ahead vs react step by step: ReAct vs plan-and-execute vs Tree-of-Thoughts, the cost of replanning, and the speculative-execution parallel.
The Agent Loop: ReAct and Its Descendants
How the agent loop actually works: ReAct's thought/action/observation cycle, plan-and-execute, stopping conditions, and the leader-election parallel.
Constrained Decoding: Grammars, Regex, and FSMs
How constrained decoding works: vocabulary masking, FSMs and pushdown automata, GBNF grammars, XGrammar/Outlines/llama.cpp, and the format tax.
Prompt Caching: Reusing the KV Cache Across Calls
How prompt caching reuses the KV cache across API calls: Anthropic breakpoints, OpenAI's automatic prefix cache, Gemini context cache, and cost math.
Streaming and Backpressure
Token-by-token LLM streaming end to end: SSE vs WebSockets, partial JSON parsing, cancellation with AbortController, and where backpressure actually bites.
Function Calling and Tool Use
Tool use is typed RPC for LLMs: tool schemas, the call-result loop, parallel calls, tool_choice, OpenAI vs Anthropic differences, and failure modes.
Structured Output: JSON Mode and Schema Coercion
Reliable JSON from LLMs: JSON mode vs strict json_schema vs tool use vs retry-on-validate, with Instructor and the Vercel AI SDK in practice.
Context Engineering: JIT vs AOT Context Loading
Context as the scarcest resource in an LLM call: how AOT prepacking and JIT retrieval compose, and the OS prefetch-vs-demand-paging parallel.
RAG Evaluation: Recall, Faithfulness, and Answer Quality
Retrieval metrics, generation metrics, and the judge problem: how to evaluate a RAG pipeline end-to-end with recall@k, faithfulness, and Ragas.
Query Transformations: Rewriting, HyDE, and Multi-Query
The query-side preprocessing layer for RAG: how rewriting, HyDE, multi-query, decomposition, and step-back prompting trade cost for recall.
Reranking: Cross-Encoders and Cascades
Why cross-encoders dominate the precision stage of retrieval, when a reranker pays off, and how to compose cascades that respect the latency budget.
Hybrid Search: BM25 Meets Dense Vectors
Why dense retrieval misses rare terms and exact matches, how BM25 and embeddings fuse via RRF, and the hybrid patterns that ship in production.
Chunking Strategies for Retrieval
Why chunk size is RAG's most undertuned variable, how recursive, semantic, and structural chunking differ, and when parent-document retrieval wins.
Vector Databases & ANN Indexes
How HNSW, IVF, and ScaNN trade recall for speed, why exact KNN doesn't scale, and how to pick between pgvector, Qdrant, and Pinecone in production.
Text Embeddings: Turning Meaning into Geometry
How embedding models encode text as dense vectors, why cosine similarity captures meaning, and how to build semantic search in Python and TypeScript.
LLM Inference: Tokens, Context, and Sampling
How LLMs process text: BPE tokenization, the context window as working memory, KV caching, and sampling parameters that shape output variance.
Writing Event Loops with Java Virtual Threads
A practical guide to writing small event loops in Java 21 and Java 25 using virtual threads, blocking queues, direct control flow, and graceful shutdown.
Context vs Prompt Engineering: The Evolution from Instructions to Intelligence
Exploring the shift from prompt engineering to context engineering in AI systems, understanding context rot, and why managing context is becoming more critical than crafting prompts.
StampedLock: How to Use Locks with Near Lock-Free Reads in Java
Learn how Java’s StampedLock enables near lock-free reads with optimistic locking, why it’s useful for virtual threads and read-heavy workloads, and how to use it safely.