$ cat ai-engineering/text-embeddings.md

Text Embeddings: Turning Meaning into Geometry

How embedding models encode text as dense vectors, why cosine similarity measures semantic distance, and how to build semantic search in Python and TypeScript.

Jatin Bansal@blog:~/ai-engineering$ open text-embeddings

Embeddings are the lingua franca of modern AI systems. Every RAG pipeline, semantic cache, recommendation engine, and anomaly detector operating on text depends on one assumption: that meaning can be represented as a point in a high-dimensional space, and that similar meanings land near each other. If you are building systems that retrieve, compare, or cluster text at any scale, you cannot treat this layer as a black box.

Opening bridge

In the previous article we established that an LLM’s context window is its working memory — a finite space into which all relevant information must fit before the model can reason. The natural follow-up question is: how do you decide which information to load? That is the retrieval problem. Embeddings are the primitive that makes retrieval semantic rather than lexical — they let you find documents by what they mean, not just which words they contain. Everything in the RAG pipeline sits on top of this layer.

What an embedding is

An embedding is a function f: text → ℝⁿ — it maps arbitrary text to a fixed-size vector of floating-point numbers. The model is trained so that semantically similar inputs produce geometrically proximate outputs.

“The server is unreachable” and “the host is not responding” → land near each other. “The server is unreachable” and “my cat knocked over the espresso machine” → land far apart.

The dimensionality n depends on the model: OpenAI’s text-embedding-3-small produces 1536-dimensional vectors; text-embedding-3-large produces 3072. These numbers are large enough to encode fine-grained semantic distinctions, small enough for practical indexing.

The geometry of meaning

In a keyword search, you find documents by matching the exact words in a query. In semantic search, you ask for “distributed consensus protocols” and surface papers on Paxos, Raft, and Zookeeper — none of which may contain that exact phrase.

Embeddings achieve this by mapping text into a vector space where the geometry encodes semantics. Instead of asking “are these strings identical?”, you ask “at what angle do their vectors intersect?”

Cosine similarity is the standard metric:

text

1
sim(a, b) = (a · b) / (‖a‖ · ‖b‖)

It measures the cosine of the angle between two vectors, ignoring magnitude. Range is [-1, 1]: 1 means identical direction (maximum similarity), 0 means orthogonal (unrelated), -1 means opposite direction.

Why cosine over Euclidean distance? Because magnitude in embedding space does not encode semantic intensity — a longer document does not produce a “bigger” vector. L2-normalizing all vectors first makes cosine similarity equivalent to dot product, which vector databases optimize aggressively.

The distributed systems parallel

Git and IPFS use content-addressable storage (CAS): each object is addressed by a cryptographic hash of its content. Same content → same address; one bit flipped → completely different address. The hash function is deliberately collision-resistant — similar inputs must produce distant hashes.

Embedding models invert the design goal. They are trained to act like locality-sensitive hashing (LSH): similar inputs should land close together, not far apart. LSH is used in database systems for approximate nearest-neighbor (ANN) queries — find all points within distance ε of a query without a full table scan.

The practical consequence: exact-match caching returns a hit only on identical queries. Semantic caching (e.g., GPTCache) uses embedding similarity to return cached responses for semantically equivalent queries — the infrastructure analogue of an LSH-bucketed cache.

Systems concept	Embedding analogue
CAS hash (SHA-256)	Token ID — exact, deterministic, brittle to changes
Locality-sensitive hash	Embedding — approximate, tolerant to paraphrase
ANN index (HNSW, IVF)	Vector database — nearest-neighbor search at scale
Exact-match cache key	Query string — misses on any paraphrase
Semantic cache key	Query embedding — hits on equivalent intent

Mechanics: inside an embedding model

Most embedding models are encoder-only transformers — the same BERT architecture, not GPT-style decoder models. The distinction matters: encoder models process the full input with bidirectional attention in one forward pass (every token attends to every other token), rather than left-to-right autoregressive generation.

Steps:

Tokenize the input using the same BPE tokenization described in the previous article.
Run the full sequence through the transformer stack with bidirectional attention.
Pool the per-token output hidden states into a single vector. Two common strategies:
- [CLS] token pooling: a special classification token prepended to the input; its output representation summarizes the full sequence.
- Mean pooling: average all token hidden states — generally preferred for retrieval tasks.
L2-normalize the result so that dot product equals cosine similarity.

Output: a single fixed-size vector regardless of input length — from one word to the model’s full context limit.

Matryoshka Representation Learning (MRL): OpenAI’s text-embedding-3 models support a dimensions parameter that lets you truncate the vector to a smaller size (e.g., 256 or 512) without re-running the model. The model is trained with a Matryoshka objective: the first d dimensions are themselves a valid embedding at dimension d. Halving dimensions roughly halves storage and ANN query cost with modest quality degradation — worth measuring for high-throughput applications.

Building semantic search

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# pip install openai numpy
from openai import OpenAI
import numpy as np

client = OpenAI()  # reads OPENAI_API_KEY from environment

def embed(text: str, model: str = "text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(model=model, input=text)
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    va, vb = np.array(a), np.array(b)
    return float(np.dot(va, vb) / (np.linalg.norm(va) * np.linalg.norm(vb)))

# --- build a tiny corpus ---
documents = [
    "The database is unreachable after a network partition.",
    "Deployment failed because the health check timed out.",
    "Users are seeing 503 errors on the checkout page.",
    "Memory usage climbed to 98% before the OOM kill.",
    "A new release was tagged and pushed to the registry.",
]

# embed in one batch call — far cheaper than N single calls
batch_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=documents,
)
doc_embeddings = [item.embedding for item in batch_response.data]

# --- query ---
query = "service is down and not responding to requests"
query_embedding = embed(query)

scores = [
    (doc, cosine_similarity(query_embedding, emb))
    for doc, emb in zip(documents, doc_embeddings)
]
scores.sort(key=lambda x: x[1], reverse=True)

for doc, score in scores:
    print(f"{score:.3f}  {doc}")
# 0.521  The database is unreachable after a network partition.
# 0.461  Users are seeing 503 errors on the checkout page.
# 0.449  Deployment failed because the health check timed out.
# ...

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// npm install openai
import OpenAI from "openai";

const client = new OpenAI(); // reads OPENAI_API_KEY from environment

async function embed(
  text: string,
  model = "text-embedding-3-small"
): Promise<number[]> {
  const response = await client.embeddings.create({ model, input: text });
  return response.data[0].embedding;
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dot = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
  const normA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
  const normB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
  return dot / (normA * normB);
}

// --- build a tiny corpus ---
const documents = [
  "The database is unreachable after a network partition.",
  "Deployment failed because the health check timed out.",
  "Users are seeing 503 errors on the checkout page.",
  "Memory usage climbed to 98% before the OOM kill.",
  "A new release was tagged and pushed to the registry.",
];

// embed all documents in one batch call
const batchResponse = await client.embeddings.create({
  model: "text-embedding-3-small",
  input: documents,
});
const docEmbeddings = batchResponse.data.map((item) => item.embedding);

// --- query ---
const queryEmbedding = await embed(
  "service is down and not responding to requests"
);

const scores = documents.map((doc, i) => ({
  doc,
  score: cosineSimilarity(queryEmbedding, docEmbeddings[i]),
}));
scores.sort((a, b) => b.score - a.score);
scores.forEach(({ doc, score }) =>
  console.log(`${score.toFixed(3)}  ${doc}`)
);

The OpenAI Node.js SDK and OpenAI Python SDK both support batch embedding via an array input — always prefer batching over per-document API calls to reduce latency and cost.

For local embedding without API calls, sentence-transformers (pip install sentence-transformers) provides a large catalogue of models compatible with the same cosine similarity workflow. The tradeoff: you need GPU memory to run them at speed, and model quality varies significantly by task. Check the MTEB leaderboard for task-specific rankings.

Trade-offs, failure modes, gotchas

Context length limits are silent truncation. text-embedding-3-small has an 8191-token limit. Text beyond that is silently dropped — you get an embedding for a truncated document with no error or warning. For documents longer than ~6k tokens, chunk first, embed each chunk separately, and store chunk-level embeddings. The retrieval granularity should match the context unit you intend to inject into the LLM.

Asymmetric vs. symmetric retrieval. An embedding model optimized for retrieving long documents from short queries (asymmetric) performs differently from one optimized for comparing sentence pairs (symmetric). Using a symmetric model for asymmetric retrieval degrades recall measurably. Check the MTEB “Retrieval” category for models suited to your use case, not just overall leaderboard rank.

Embedding drift on model upgrades. Stored vectors and query vectors must come from the same model version. Reindexing with a newer model makes all existing vectors incompatible — cosine similarity across different model versions is meaningless. Version-lock your embedding model in production. If you must migrate, re-embed the full corpus in one operation and swap atomically.

Cosine similarity scores have no calibrated absolute meaning. A score of 0.85 is not “85% similar” in any interpretable sense. Thresholds are dataset- and model-specific. Calibrate against labeled pairs from your actual domain before setting cutoffs. A common mistake: copying a threshold from a paper or blog post written for a different dataset and domain.

Batch everything. Single-document API calls accumulate per-request overhead fast. The batch form (input: [list of strings]) runs at nearly the same latency for 10 documents as for 1, and most providers charge identical per-token rates regardless. There is almost no reason to embed documents one at a time.

Expensive at scale. Generating embeddings costs API calls; storing and searching millions of vectors requires a dedicated ANN index. Both are solvable — batch embedding, managed vector databases (Pinecone, Weaviate, pgvector, Qdrant) — but neither is free. Profile before over-building, and keep the corpus size in mind when choosing dimensionality.