// ls categories/ai-engineering/

Ai-Engineering

59 entries

last write 2026-05-26

rss · github · linkedin · x · topmate

2026 59 entries

2026-05-26
Quantization and Distillation: Compression for Inference
How big models get cheap: the math behind INT8/INT4/FP8 quantization, GPTQ/AWQ/SmoothQuant, soft-target distillation, and the 2026 production stack.
ai-engineering
25 min
2026-05-26
LoRA and Parameter-Efficient Fine-Tuning
LoRA, QLoRA, DoRA, and the PEFT stack in 2026: the math, the production defaults (rank, alpha, target modules), and the multi-tenant serving pattern.
ai-engineering
23 min
2026-05-26
DPO and Modern Alignment
DPO derivation, the IPO/KTO/ORPO/SimPO variants, the 2026 production stack, and the failure modes (length bias, distribution shift, mode collapse).
ai-engineering
21 min
2026-05-26
From Pre-Training to RLHF
The three-stage LLM training pipeline — pretraining, SFT, preference optimization — what each step changes, why each exists, and the 2026 reality.
ai-engineering
18 min
2026-05-26
Agent Budgets and Runaway Prevention
Step caps, deadlines, token and dollar ceilings, oscillation detection — the OS and distributed-systems primitives every agent harness ports.
ai-engineering
32 min
2026-05-26
PII Detection and Data Privacy for LLM Systems
PII detection and data residency for LLM systems: Presidio cascades, OpenAI Privacy Filter, GDPR deletion pipelines, EU residency, and on-device inference.
ai-engineering
29 min
2026-05-26
Guardrails: Input and Output Safety Layers for LLM Systems
Input and output guardrails for LLM apps: prompt-injection defense, Llama Guard 4, NeMo Guardrails, LlamaFirewall, and the WAF defense-in-depth parallel.
ai-engineering
30 min
2026-05-26
Fine-Tuning vs RAG: When to Choose Which
A decision tree for fine-tuning vs RAG: what each tool actually changes, the cost model, where each fails, and why most 2026 production stacks ship both.
ai-engineering
27 min
2026-05-26
Cost Optimization and Model Routing
Tiered model routing, cascades, and learned routers — RouteLLM, Martian, NotDiamond, OpenRouter, LiteLLM — plus the cost math that tells you when to route.
ai-engineering
24 min
2026-05-26
Speculative Decoding and Draft Models
Draft-and-verify decoding: how speculative sampling, Medusa, EAGLE-3, and ngram methods turn one forward pass into many tokens — and when it pays.
ai-engineering
25 min
2026-05-26
Inference Latency: Prefill, Decode, and Batching
Inside the inference server: prefill vs decode, continuous batching, chunked prefill, prefill/decode disaggregation, TTFT/TPOT — and the dials.
ai-engineering
25 min
2026-05-19
Human-in-the-Loop Feedback Loops for LLM Systems
Turning thumbs, edits, and re-rolls into a data flywheel: capturing user feedback, sampling traces for review, label hygiene, and selective annotation.
ai-engineering
28 min
2026-05-19
Drift Detection and Regression Testing for LLM Systems
Detecting input and output distribution shift in LLM apps, plus the regression-testing protocol for model upgrades: shadow runs, canaries, judge replays.
ai-engineering
29 min
2026-05-19
Production Tracing and Observability for LLM Systems
Distributed tracing for LLM apps in 2026: span shape, OTel GenAI semantics, OpenInference, sampling, and the LangSmith/Langfuse/Phoenix decision.
ai-engineering
25 min
2026-05-19
LLM-as-Judge: Pointwise and Pairwise
How LLM-as-judge works in production: rubrics, pointwise vs pairwise, position/verbosity/self-preference bias, and how to calibrate against humans.
ai-engineering
22 min
2026-05-19
Eval-Driven Development for LLM Systems
Why evals replace unit tests for LLM systems: error-analysis-first workflow, golden sets, the test pyramid, and CI-gate harnesses in Python and TS.
ai-engineering
20 min
2026-05-19
Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti
MemGPT/Letta, mem0, Zep, and Graphiti compared on architecture, write/read paths, benchmarks, and the build-versus-buy decision for production memory.
ai-engineering
14 min
2026-05-19
Memory Evaluation: Benchmarks and Custom Evals
Memory evaluation for agents: LoCoMo and LongMemEval, multi-hop recall, contradiction handling, and how to design a custom eval that catches drift.
ai-engineering
17 min
2026-05-19
Conversation Compaction: Keeping Long Sessions Alive
Conversation compaction in long agent sessions: reactive vs preemptive triggers, cache-aware deletion, circuit breakers, snapshot-rollback, journals.
ai-engineering
23 min
2026-05-19
Anatomy of an Agent Harness
Inside the agent harness: context assembly, tool dispatch, streaming, cache management, error recovery, cost accounting, telemetry — and build-vs-buy.
ai-engineering
18 min
2026-05-19
Long-Horizon Task Reliability
Drift, checkpointing, and recovery in long-running agents: the distributed-saga parallel, when to abort, and the METR doubling curve.
ai-engineering
19 min
2026-05-19
Computer Use and Browser Agents
Screenshot-driven and DOM-driven agents in 2026: action grounding, accessibility tree vs pixel input, sandboxing, prompt injection, OSWorld.
ai-engineering
23 min
2026-05-19
Memory Privacy, Isolation, and Multi-Tenancy
Per-tenant memory isolation for LLM agents: namespace discipline, cross-tenant leak modes, prompt-injection-via-memory, and verifiable GDPR deletion.
ai-engineering
37 min
2026-05-18
Multi-Agent Shared Memory
Shared memory across LLM agents: scoping rules, consistency models, blackboard vs shared-block vs cross-thread store patterns, and the split-brain bugs.
ai-engineering
34 min
2026-05-18
Cross-Session Identity and Personalization
Cross-session identity for LLM agents: user profiles, personas, the cold-start staircase, sensitivity-gated writes, and the deletion path.
ai-engineering
32 min
2026-05-18
Procedural Memory and Skill Caching
Procedural memory for AI agents: caching successful action sequences as a JIT-compiled-routine store. Voyager, AWM, LangMem, Agent Skills.
ai-engineering
40 min
2026-05-18
Memory Conflict, Forgetting, and Embedding Drift
Three failures of agent memory at scale: contradiction handling, active forgetting, and embedding drift — with worked patterns and code.
ai-engineering
36 min
2026-05-18
Temporal Reasoning and Memory Provenance
Temporal reasoning and provenance in agent memory: as-of queries, bi-temporal validity, dated claims, staleness gates, and per-fact source audit trails.
ai-engineering
35 min
2026-05-18
Memory Retrieval Policies: Recency, Relevance, Importance
Memory retrieval policies: the recency-relevance-importance rerank, exponential decay, read-time boosts, and the LRU/LFU/ARC cache parallel for agents.
ai-engineering
34 min
2026-05-18
Sleep-Time Compute and Memory Consolidation
Sleep-time compute for AI agents: background consolidation, the VACUUM parallel, Letta's sleep-time agents, Claude Code's auto-dream, and the cost math.
ai-engineering
32 min
2026-05-18
Summarization and Context Compression
Context compression for LLM agents: recursive summarization, structured note-taking, measuring quality loss, and the log-compaction parallel.
ai-engineering
28 min
2026-05-18
Reflection: From Experiences to Beliefs
Memory reflection: write-time enrichment that turns raw episodes into higher-order beliefs, the Generative Agents reflection loop, and its failure modes.
ai-engineering
32 min
2026-05-18
Episode Segmentation and Salience Scoring
Episode segmentation and salience scoring: prediction-error and topic-shift boundaries, anchored 1-10 importance, the event-sourcing aggregate parallel.
ai-engineering
33 min
2026-05-18
Memory Write Policies: What's Worth Remembering
Memory write policies: distillation, write amplification, the journaling-vs-checkpoint trade-off, learned classifiers, and admission control for agents.
ai-engineering
34 min
2026-05-18
Hierarchical Memory: Working / Episodic / Semantic Tiers
Hierarchical memory: MemGPT/Letta's three-tier OS-paging model, what lives in core/recall/archival, and the promotion-demotion policies that bind them.
ai-engineering
29 min
2026-05-18
Knowledge Graphs as Structured Memory
When graphs beat vectors as memory: entities, relations, bi-temporal validity, Graphiti/Zep/Mem0g patterns, and hybrid graph+vector retrieval.
ai-engineering
27 min
2026-05-18
Long-Term Memory: Vector-Backed Episodic Storage
Long-term episodic memory: vector-backed storage, episode boundaries, recency-weighted retrieval, the WAL parallel, and the unit-of-recall problem.
ai-engineering
26 min
2026-05-18
Working Memory: Scratchpads, Blackboards, and Agent Notebooks
Working memory for agents: scratchpads, blackboards, notebooks, and dataflow state — the in-context surface that sits above the conversation buffer.
ai-engineering
20 min
2026-05-18
Short-Term Memory: Managing the Conversation Buffer
Truncation policies for the LLM conversation buffer: sliding windows, token-level vs message-level eviction, system-prompt protection, headroom budgeting.
ai-engineering
23 min
2026-05-18
The Cognitive Taxonomy: Semantic, Episodic, Procedural
A close read of the four cognitive memory types — working, episodic, semantic, procedural — and the CPU cache hierarchy each one maps onto.
ai-engineering
31 min
2026-05-18
The Memory Stack: A Map of AI Memory
A map of AI agent memory: in-context vs storage, the four cognitive types, the write/read/maintain axes, and why memory isn't RAG with a longer leash.
ai-engineering
24 min
2026-05-18
Tool Selection at Scale: MCP and Dynamic Routing
Why tool selection collapses past 30 tools, and how MCP, lazy loading, and retrieval keep accuracy high across thousands of tools without context bloat.
ai-engineering
22 min
2026-05-18
Multi-Agent Orchestration
Supervisor, swarm, and hierarchical multi-agent patterns: the A2A protocol, split-brain failure modes, the 15x token tax, and when not to reach for it.
ai-engineering
20 min
2026-05-18
Planning Agents vs Reactive Agents
When to plan ahead vs react step by step: ReAct vs plan-and-execute vs Tree-of-Thoughts, the cost of replanning, and the speculative-execution parallel.
ai-engineering
21 min
2026-05-18
The Agent Loop: ReAct and Its Descendants
How the agent loop actually works: ReAct's thought/action/observation cycle, plan-and-execute, stopping conditions, and the leader-election parallel.
ai-engineering
22 min
2026-05-18
Constrained Decoding: Grammars, Regex, and FSMs
How constrained decoding works: vocabulary masking, FSMs and pushdown automata, GBNF grammars, XGrammar/Outlines/llama.cpp, and the format tax.
ai-engineering
17 min
2026-05-18
Prompt Caching: Reusing the KV Cache Across Calls
How prompt caching reuses the KV cache across API calls: Anthropic breakpoints, OpenAI's automatic prefix cache, Gemini context cache, and cost math.
ai-engineering
17 min
2026-05-18
Streaming and Backpressure
Token-by-token LLM streaming end to end: SSE vs WebSockets, partial JSON parsing, cancellation with AbortController, and where backpressure actually bites.
ai-engineering
16 min
2026-05-18
Function Calling and Tool Use
Tool use is typed RPC for LLMs: tool schemas, the call-result loop, parallel calls, tool_choice, OpenAI vs Anthropic differences, and failure modes.
ai-engineering
14 min
2026-05-17
Structured Output: JSON Mode and Schema Coercion
Reliable JSON from LLMs: JSON mode vs strict json_schema vs tool use vs retry-on-validate, with Instructor and the Vercel AI SDK in practice.
ai-engineering
13 min
2026-05-17
Context Engineering: JIT vs AOT Context Loading
Context as the scarcest resource in an LLM call: how AOT prepacking and JIT retrieval compose, and the OS prefetch-vs-demand-paging parallel.
ai-engineering
15 min
2026-05-17
RAG Evaluation: Recall, Faithfulness, and Answer Quality
Retrieval metrics, generation metrics, and the judge problem: how to evaluate a RAG pipeline end-to-end with recall@k, faithfulness, and Ragas.
ai-engineering
17 min
2026-05-16
Query Transformations: Rewriting, HyDE, and Multi-Query
The query-side preprocessing layer for RAG: how rewriting, HyDE, multi-query, decomposition, and step-back prompting trade cost for recall.
ai-engineering
13 min
2026-05-14
Reranking: Cross-Encoders and Cascades
Why cross-encoders dominate the precision stage of retrieval, when a reranker pays off, and how to compose cascades that respect the latency budget.
ai-engineering
13 min
2026-05-13
Hybrid Search: BM25 Meets Dense Vectors
Why dense retrieval misses rare terms and exact matches, how BM25 and embeddings fuse via RRF, and the hybrid patterns that ship in production.
ai-engineering
12 min
2026-05-12
Chunking Strategies for Retrieval
Why chunk size is RAG's most undertuned variable, how recursive, semantic, and structural chunking differ, and when parent-document retrieval wins.
ai-engineering
12 min
2026-05-11
Vector Databases & ANN Indexes
How HNSW, IVF, and ScaNN trade recall for speed, why exact KNN doesn't scale, and how to pick between pgvector, Qdrant, and Pinecone in production.
ai-engineering
12 min
2026-05-11
Text Embeddings: Turning Meaning into Geometry
How embedding models encode text as dense vectors, why cosine similarity captures meaning, and how to build semantic search in Python and TypeScript.
ai-engineering
10 min
2026-05-11
LLM Inference: Tokens, Context, and Sampling
How LLMs process text: BPE tokenization, the context window as working memory, KV caching, and sampling parameters that shape output variance.
ai-engineering
10 min

Ai-Engineering

Quantization and Distillation: Compression for Inference

LoRA and Parameter-Efficient Fine-Tuning

DPO and Modern Alignment

From Pre-Training to RLHF

Agent Budgets and Runaway Prevention

PII Detection and Data Privacy for LLM Systems

Guardrails: Input and Output Safety Layers for LLM Systems

Fine-Tuning vs RAG: When to Choose Which

Cost Optimization and Model Routing

Speculative Decoding and Draft Models

Inference Latency: Prefill, Decode, and Batching

Human-in-the-Loop Feedback Loops for LLM Systems

Drift Detection and Regression Testing for LLM Systems

Production Tracing and Observability for LLM Systems

LLM-as-Judge: Pointwise and Pairwise

Eval-Driven Development for LLM Systems

Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti

Memory Evaluation: Benchmarks and Custom Evals

Conversation Compaction: Keeping Long Sessions Alive

Anatomy of an Agent Harness

Long-Horizon Task Reliability

Computer Use and Browser Agents

Memory Privacy, Isolation, and Multi-Tenancy

Multi-Agent Shared Memory

Cross-Session Identity and Personalization

Procedural Memory and Skill Caching

Memory Conflict, Forgetting, and Embedding Drift

Temporal Reasoning and Memory Provenance

Memory Retrieval Policies: Recency, Relevance, Importance

Sleep-Time Compute and Memory Consolidation

Summarization and Context Compression

Reflection: From Experiences to Beliefs

Episode Segmentation and Salience Scoring

Memory Write Policies: What's Worth Remembering

Hierarchical Memory: Working / Episodic / Semantic Tiers

Knowledge Graphs as Structured Memory

Long-Term Memory: Vector-Backed Episodic Storage

Working Memory: Scratchpads, Blackboards, and Agent Notebooks

Short-Term Memory: Managing the Conversation Buffer

The Cognitive Taxonomy: Semantic, Episodic, Procedural

The Memory Stack: A Map of AI Memory

Tool Selection at Scale: MCP and Dynamic Routing

Multi-Agent Orchestration

Planning Agents vs Reactive Agents

The Agent Loop: ReAct and Its Descendants

Constrained Decoding: Grammars, Regex, and FSMs

Prompt Caching: Reusing the KV Cache Across Calls

Streaming and Backpressure

Function Calling and Tool Use

Structured Output: JSON Mode and Schema Coercion

Context Engineering: JIT vs AOT Context Loading

RAG Evaluation: Recall, Faithfulness, and Answer Quality

Query Transformations: Rewriting, HyDE, and Multi-Query

Reranking: Cross-Encoders and Cascades

Hybrid Search: BM25 Meets Dense Vectors

Chunking Strategies for Retrieval

Vector Databases & ANN Indexes

Text Embeddings: Turning Meaning into Geometry

LLM Inference: Tokens, Context, and Sampling