$ cat ai-engineering/constrained-decoding.md

Constrained Decoding: Grammars, Regex, and FSMs

How constrained decoding works: vocabulary masking, FSMs and pushdown automata, GBNF grammars, XGrammar/Outlines/llama.cpp, and the format tax.

Jatin Bansal@blog:~/ai-engineering$ open constrained-decoding

A team replaces their retry-on-validate JSON pipeline with strict schema-constrained decoding on a self-hosted Qwen model. Validation failures fall from 4% to 0%. Two weeks later a downstream analyst notices answer quality has slipped — the model is choosing safer, blander values for an open-ended summary field, and a sibling reasoning field that used to contain genuinely useful step-by-step prose now reads like a robot reciting a checklist. Nothing in the schema changed. Nothing in the prompt changed. What changed is that every decode step is now sampling from a masked distribution, and on the borderline tokens where natural-language thinking lives, the mask is biting. Welcome to constrained decoding: 100% structural correctness on one axis, a quietly different distribution on the other.

Opening bridge

Yesterday’s piece on prompt caching was about reusing the prefix of inference work — the K/V tensors computed during prefill — across calls. Today’s piece is about the opposite end of the same loop: what the decoder is allowed to emit, token by token, after prefill finishes. The structured-output article earlier in this subtree introduced the idea — “sampling with a vocabulary mask that changes every step” — and used it through the high-level wrappers (Instructor, generateObject, strict json_schema mode). Today we open the box: what’s the mask made of, how is it compiled, where do the costs sit, and when does the constraint hurt the model’s reasoning?

What constrained decoding actually is

Constrained decoding is sampling with a per-step vocabulary mask. At every decode step, before the sampler picks a token, an external automaton — usually a finite-state machine for regex/JSON schema, or a pushdown automaton for full context-free grammars — is consulted: “given the tokens we’ve emitted so far, which of the 200,000 tokens in the vocabulary are legal next?” Every illegal token has its logit set to -inf, the softmax assigns it zero probability, and the sampler can no longer pick it.

That single mechanism — logits := logits + mask where mask[i] ∈ {0, -inf} — covers everything in this article. Logit biasing is the special case where the mask is constant across decode steps. JSON Schema is the case where the mask is computed from an FSM over a regex derived from the schema. A full grammar (GBNF, EBNF, a Lark spec) is the case where the mask is computed from a pushdown automaton walking the grammar’s productions. The shapes differ; the primitive is identical.

The LLM inference fundamentals article framed sampling as softmax(logits / T) with optional top-k or top-p truncation. Constrained decoding is that same step with a third truncation: a structural one applied before temperature kicks in. The model still chooses among legal tokens by probability — constrained decoding does not select for you, it only forbids the impossible.

Three layers of constraint, from weakest to strongest

Logit bias (constant mask). The oldest and weakest form. You pass a token-ID-to-bias map; the provider adds the bias to the matching logits on every step. Set bias to -100 to effectively forbid a token, +10 to strongly encourage one. OpenAI exposes logit_bias on the chat and completions APIs. Anthropic does not. The mechanism is too coarse to enforce structure — you can ban “I cannot” or force the model toward “yes” or “no”, but you cannot express “must be a valid date” because that requires the mask to change depending on what’s been emitted so far. Logit bias is still useful: profanity blocklists, A/B-test answer keys, banning a known hallucination pattern. It is not constrained decoding in the structural sense; it is structural decoding’s degenerate one-state case.

FSM-based constraint (regex / JSON Schema). A regular expression compiles to a deterministic finite automaton: a finite set of states, a transition function state × character → state, an accept set. JSON Schema in strict mode (objects with fixed properties, enums, format validators, primitive types) is regular too — you can compile a strict schema to a single regex and from there to an FSM. The trick that makes this fast at LLM serving time is a precomputed index: for every state of the FSM and every token in the vocabulary, you precompute “if I’m in state s and I emit token t, am I still in the language? and what state do I land in?” Store the result as a sparse matrix and the per-token cost becomes a constant-time lookup plus a vocabulary-sized mask write. Outlines was the canonical implementation of this idea; the Rust core outlines-core ships the FSM compiler with worst-case microsecond-per-token overhead.

Pushdown automaton (context-free grammars). Regex isn’t powerful enough for full JSON (nested objects of arbitrary depth) or any real programming-language grammar. Context-free grammars are. The corresponding machine is a pushdown automaton: an FSM plus a stack. To match { "a": { "b": [...] } } the automaton pushes a frame for each { or [ and pops it on the matching } or ]. Compiling a grammar to a PDA and then masking the vocabulary per step is the technique XGrammar pioneered, now integrated into vLLM, SGLang, and TensorRT-LLM. llama.cpp’s GBNF grammars and guidance-ai’s llguidance implement the same primitive with different syntax. The “G” in GBNF is “GGML”; the BNF is real BNF — same notation that defines C and SQL grammars. You can constrain a model’s output to literally any context-free language. That’s the killer feature: SQL queries, valid Python ASTs, formal proof languages, custom DSLs.

The relationship between these three is a Chomsky hierarchy: every regular language is context-free, every context-free language permits pushdown recognition. Pick the cheapest tool that expresses your constraint. Don’t reach for a grammar to enforce ["USD", "EUR", "GBP"] — an FSM-backed enum is faster and trivially equivalent.

The distributed-systems parallel

The closest parallel is a firewall sitting between two services: the model is the upstream producer, the sampler/runtime is the downstream consumer, and the constraint engine is a stateful packet filter in the middle that drops anything not matching its rule set. Like a firewall, it has two failure modes — false positives (legitimate output the constraint rejects, forcing the model into less-natural continuations) and state explosion (rules so expressive the matcher can’t keep up at line rate). Both translate directly to constrained decoding: an over-narrow schema is a too-aggressive firewall rule; an unbounded grammar is a regex that backtracks forever.

The other parallel is a Coordinator Transactions style approach in distributed databases — except inverted. A 2PC coordinator vetoes a transaction after the participants have prepared; constrained decoding vetoes a token before it is sampled. It’s “schema-on-write at the wire format,” extended to a wire format that’s being generated token by token. The structured-output article walked through the Protobuf/Avro analogy at the API boundary. Constrained decoding is the lower stratum: enforcement at the byte stream, not the deserialized object.

Mechanics: how a JSON Schema becomes a vocabulary mask

Walk through one concrete pipeline end to end.

Schema → regex (or pseudo-regex). A strict JSON Schema is compiled into a structural regex: \{ followed by "name": followed by [A-Za-z0-9 ]{1,32} followed by ,"age": followed by \d{1,3} followed by \}. Real implementations don’t emit literal regex strings; they emit an intermediate AST that gets compiled to an FSM directly. But thinking of it as a regex is the right mental model.
Regex → FSM. Standard Thompson/Glushkov construction. Each character class becomes a transition; each {n,m} repetition becomes a counter; the final state is the accept set.
FSM × vocabulary → mask index. For every state s in the FSM and every token t in the vocabulary, compute whether the byte sequence of t can be appended to a string that ends in state s and still potentially complete to a string in the language. If yes, store the resulting next-state; if no, mark (s, t) as forbidden. This is the expensive precomputation step — for a 100k-token vocabulary and a moderately complex schema, it takes 50–500ms on first compile. Most production stacks cache the result keyed on (schema_hash, tokenizer_hash).
At decode time. The runtime tracks the current FSM state. Before sampling, it looks up the row for that state in the mask index and adds -inf to every forbidden token’s logit. The sampler picks a legal token; the runtime advances the FSM state; repeat until the accept state is reached or max_tokens exhausts.

The whole machine runs entirely on the host side of the inference loop — no model retraining, no architectural change, no quality cost from the structural constraint itself (we’ll get to the behavioral cost later). XGrammar’s contribution was figuring out how to do the per-step PDA computation in <40 microseconds even for grammars with hundreds of productions; its trick is overlapping the mask computation with the GPU’s logit computation so the structural overhead becomes invisible at the throughput level.

Code: Python with Outlines on a self-hosted model

Outlines is the workhorse on the Python side for local-model constrained decoding. It supports JSON Schema, regex, and context-free grammars (EBNF / Lark) through a unified surface. Install: pip install outlines transformers torch.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# pip install outlines transformers torch
import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer
from pydantic import BaseModel, Field
from typing import Literal

MODEL_NAME = "Qwen/Qwen3-4B-Instruct"

model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained(MODEL_NAME),
    AutoTokenizer.from_pretrained(MODEL_NAME),
)

# 1) JSON Schema constraint via Pydantic
class Triage(BaseModel):
    category: Literal["billing", "technical", "account", "other"]
    severity: Literal["low", "medium", "high", "critical"]
    needs_human: bool
    summary: str = Field(min_length=10, max_length=200)

prompt = "Customer email: 'My card was charged twice for the same order #4421. Refund please.'"
result = model(prompt, Triage)
print(result.model_dump())  # guaranteed-conforming Triage object

# 2) Regex constraint — phone number in a specific format
phone_regex = r"\+\d{1,3}-\d{3}-\d{3}-\d{4}"
prompt = "Extract the phone number from: 'Call us at +1-415-555-2671'"
phone = model(prompt, outlines.Regex(phone_regex))
print(phone)  # +1-415-555-2671 — pattern is enforced at decode time

# 3) Context-free grammar — a tiny arithmetic-expression language
arithmetic_grammar = """
?start: expr
?expr: term (("+" | "-") term)*
?term: factor (("*" | "/") factor)*
?factor: NUMBER | "(" expr ")"
NUMBER: /[0-9]+/
%ignore " "
"""
prompt = "Write an arithmetic expression that evaluates to about 100, using only +,-,*,/,(),digits:"
expr = model(prompt, outlines.CFG(arithmetic_grammar))
print(expr)  # always parses as a valid expression in our grammar

Three patterns worth flagging. The Pydantic path is the 80% case — what was previously “ask nicely, parse, retry on failure” is now structurally guaranteed. The Regex path is useful for things JSON Schema can’t express cleanly (phone-number patterns, ISBN check digits, currency strings). The CFG path is the escape hatch into arbitrary languages — give it a SQL grammar and you have a model that emits only syntactically valid SQL, no matter how the prompt is phrased.

A small wrinkle the Outlines docs call out: the model’s reasoning quality tends to be best when the schema is loose where reasoning happens and tight where structure matters. If you’re using a summary field that the model needs to think into, don’t max_length: 30 it — give it room to breathe. The format tax (below) is real but mostly avoidable.

Code: TypeScript with vLLM’s OpenAI-compatible server

Production deployments typically expose constrained decoding through vLLM’s structured outputs, which speaks the OpenAI Chat Completions wire format with xgrammar as the default backend. The client side stays in the OpenAI SDK; the constraint goes in the extra_body. Install: npm install openai zod zod-to-json-schema.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// npm install openai zod zod-to-json-schema
import OpenAI from "openai";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";

// vLLM running locally with --model Qwen/Qwen3-4B-Instruct
const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: "EMPTY",
});

// --- JSON-schema constraint (xgrammar backend) ---
const Triage = z.object({
  category: z.enum(["billing", "technical", "account", "other"]),
  severity: z.enum(["low", "medium", "high", "critical"]),
  needs_human: z.boolean(),
  summary: z.string().min(10).max(200),
});

async function triageEmail(email: string) {
  const resp = await client.chat.completions.create({
    model: "Qwen/Qwen3-4B-Instruct",
    messages: [{ role: "user", content: `Triage this email: ${email}` }],
    // vLLM-specific: pass the constraint via extra_body
    // (the SDK forwards unknown fields to the server)
    // @ts-expect-error: vLLM extension
    extra_body: {
      guided_json: zodToJsonSchema(Triage),
      guided_decoding_backend: "xgrammar",
    },
  });
  return Triage.parse(JSON.parse(resp.choices[0].message.content ?? "{}"));
}

// --- Grammar constraint: a SQL SELECT subset ---
const SQL_GRAMMAR = `
start: select_stmt
select_stmt: "SELECT" columns "FROM" table where? ";"
columns: column ("," column)*
column: /[a-z_]+/
table: /[a-z_]+/
where: "WHERE" condition
condition: column op value
op: "=" | "!=" | ">" | "<"
value: /'[^']*'/ | /[0-9]+/
%ignore " "
`;

async function nlToSql(question: string, schema: string) {
  const resp = await client.chat.completions.create({
    model: "Qwen/Qwen3-4B-Instruct",
    messages: [
      {
        role: "system",
        content: `Translate natural language to SQL.\nSchema:\n${schema}`,
      },
      { role: "user", content: question },
    ],
    // @ts-expect-error: vLLM extension
    extra_body: {
      guided_grammar: SQL_GRAMMAR,
      guided_decoding_backend: "xgrammar",
    },
  });
  return resp.choices[0].message.content;
}

Two operational notes. First, vLLM has been migrating from the legacy guided_* keys to a unified structured_outputs field — check the current docs for whichever version you’re running. Both backends (xgrammar and guidance) accept the same schema/grammar inputs. Second, the SQL grammar above is a toy; production text-to-SQL systems pair grammar constraints with a database-schema-aware FSM that knows valid column and table identifiers, so the model can’t hallucinate a column name that compiles. That’s the killer combination: structurally valid syntax and semantically valid identifiers, both guaranteed before the SQL ever hits your database.

Provider-side constrained decoding vs running it yourself

The cloud frontier providers expose constrained decoding through the strict-schema modes covered in the structured-output article — OpenAI’s response_format: {type: "json_schema", strict: true}, Anthropic’s strict tool use, Gemini’s response_schema. These are FSM-based JSON Schema constraints behind a managed API; you don’t see the FSM, you don’t get to pick the backend, you can’t supply a custom grammar. They cover the 80% case (typed objects, enums, structured payloads) and are the right default whenever you’re calling a hosted model.

You drop down to a self-hosted stack — Outlines, vLLM + XGrammar, llama.cpp + GBNF, llguidance — when you need one of:

Arbitrary CFGs. SQL, custom DSLs, programming-language ASTs. The hosted strict modes only accept JSON Schema.
Open-weights models. Llama, Qwen, Mistral, DeepSeek. The model lives on your hardware; the constraint engine lives next to it.
Cost-sensitive bulk extraction. A 100M-document extraction pass against a 4B-parameter model on your own GPUs can be an order of magnitude cheaper than the same workload through a frontier API, and constrained decoding lets you eat the quality gap on structure.
Determinism and reproducibility. Frontier providers occasionally update their strict-mode compilers; self-hosting pins the constraint behavior to a specific software version, useful for compliance-relevant pipelines.

The line between the two is moving fast. As of mid-2026 XGrammar-2 has been adopted by leading frontier labs in their products, so the engine behind “strict mode” on hosted APIs is increasingly the same code you’d run yourself. The decision boundary is operational (hosted vs self-hosted), not algorithmic.

Trade-offs, failure modes, gotchas

The format tax is real but model-dependent. Recent work, including the “Format Tax” study and “The Hidden Cost of Structure” at RANLP 2025, breaks structured-output degradation into two components — prompt-level (the model behaves differently when it knows it must be structured) and decoder-level (constrained sampling pushes the distribution off-manifold). The decoder-level cost is usually small on frontier models (Claude Haiku, GPT-5.4-nano, Grok 4.1-fast all show near-zero format tax in those studies); open-weights models can show meaningfully worse reasoning under strict constraints. The mitigation, repeated from the structured-output article but worth restating: leave space for the model to think. An unconstrained reasoning field before the structured fields, or the CRANE pattern of alternating constrained/unconstrained windows, recovers most of the lost quality.

Tokenization seams break naïve constraints. A regex like \d{3}-\d{4} looks crisp. But the tokenizer doesn’t know about your regex — it emits tokens whose byte boundaries don’t align to your character boundaries. The constraint engine has to ask “does any prefix of this token’s bytes leave me in a valid state, and if so what state?” — the answer can split a single legal character into multiple tokens or merge two character classes into one token. Outlines and XGrammar handle this for you; if you’re writing a constraint compiler yourself, the Lost in Space paper is the recent reference on how token-vocabulary irregularity changes constrained-decoding efficiency.

Unbounded grammars can deadlock. A CFG that allows arbitrary nesting (expr := expr "+" expr with no depth limit) lets the model open (((((... forever. Constrained decoding will not save you from this — it will happily emit valid-but-deep tokens until max_tokens runs out. Always set a depth cap in the grammar (expr := expr_d3 etc.) or accept that max_tokens is your only backstop.

The constraint cannot enforce semantics. A JSON Schema can require quantity > 0, but can it ensure total == sum(line_items)? No. Cross-field invariants, business rules, foreign-key validity — anything that requires reasoning across multiple fields — has to live in application-layer validation, not the decoder. The hybrid pattern from the structured-output article still applies: schema-constrained decoding for shape, retry-on-validate for semantics. Constrained decoding makes the retry loop disappear for the schema layer; it does not make the semantic layer disappear.

Hot-swapping the grammar mid-conversation is expensive. The compiled FSM/PDA is cached by (grammar_hash, tokenizer_hash). If your agent loop sends different schemas on each turn, you pay the compile cost on every cold schema. Pin the grammars you reuse to a stable set, or pre-warm the cache at deploy time the same way you would for prompt caching.

Strict-mode JSON Schema subsets bite open-weight users too. vLLM’s xgrammar backend is more permissive than OpenAI’s strict mode, but still has limits — unbounded oneOf over open-vocabulary strings, recursive $ref chains, regex patterns inside format. Test your schema end-to-end before assuming it’ll compile. The failure mode is typically a fall-back to unconstrained generation, not an error.

Logit bias is a different tool, not a worse tool. A common temptation is to “almost constrain” output with a few -100 biases on bad tokens. This is fine for blocklists and degenerate cases, but it cannot enforce structure — the mask is constant, so as soon as the model emits a { you can’t make the bias depend on “we’re now inside an object.” Use logit bias for content-policy nudges (banning the literal string “I cannot”), constrained decoding for shape. Mixing them is fine; substituting one for the other usually isn’t.

Constrained decoding helps small models more than large ones. A 4B-parameter model with strict JSON Schema constraints often outperforms an unconstrained 70B model on extraction tasks — the structural guarantee plus the small-model’s cheap inference is a Pareto improvement. The frontier-model side of the curve is where the format tax bites: constraining a 400B+ model that was already nearly perfect on structure can give you a worse output distribution for negligible reliability gain. Run the eval, both ways, before you commit.

Determinism is not free either. Even with a strict schema and temperature=0, two providers running the same model can produce different outputs because their FSM compilers handle a given schema differently — different tokenizer normalizations, different float arithmetic in the masking step. If you’re building an eval harness against a hosted strict mode, treat the constraint engine as part of the model under test.