Structured Output: JSON Mode and Schema Coercion
Reliable JSON from LLMs: JSON mode vs strict json_schema vs tool use vs retry-on-validate, with Instructor and the Vercel AI SDK in practice.
An invoice parser ships on a Friday. It works perfectly on the 50-document dev set and the 200-document staging set. Monday morning, the on-call gets paged: 4% of production invoices are flaming out downstream with Invalid date: 'Apr 12 2026'. The model didn’t return malformed JSON — every payload parsed. It returned a string in the wrong format for a field declared as date. The contract between the LLM and the downstream code was prose, not types. JSON mode would not have caught it; only a format: "date" constraint with schema-validated decoding would. Welcome to the structured-output problem: parseability is necessary, not sufficient.
Opening bridge
Yesterday’s piece on context engineering ended on a small but pivotal claim — structured payloads beat prose for facts the model needs to use deterministically. That argument was about the input side: how you frame the context block so attention can fetch facts by key. The mirror image is today. Once the model has produced an answer, how do you get it out in a shape your downstream code can trust without parsing prose? This is the first article in the Generation Control subtree, which returns to the inference parent (LLM Inference: Tokens, Context, and Sampling) and works through everything that happens between “the model knows the answer” and “the answer is usable as data.”
What “structured output” actually means
Structured output is the discipline of constraining the model’s response so that it conforms to a predeclared schema — usually expressed as a JSON Schema, a Pydantic model, or a Zod schema — before it reaches your code. “Conforms” is doing serious work in that sentence. It can mean four different things, in roughly increasing order of strength:
- JSON mode — the response is guaranteed to be syntactically valid JSON. Nothing about the keys, types, or values is guaranteed. This is what OpenAI’s
response_format: { type: "json_object" }and the older Gemini equivalents shipped first, and it is now considered legacy. - Schema-constrained decoding — at every decode step, the model’s logit distribution is masked so that only tokens compatible with the supplied schema are sampled. The output is guaranteed to parse and to match the schema’s structure (keys, types, enums, required fields). This is what OpenAI’s
response_format: { type: "json_schema", strict: true }(introduced August 2024), Anthropic’s Structured Outputs (public beta on the Claude Developer Platform), and the local-model libraries built on grammar-constrained generation do. - Tool-use coercion — instead of asking the model to produce JSON in its assistant message, you define a single tool whose
input_schemais your output schema, and force the model to call that tool. The tool arguments are the structured output. This pattern long predates first-class JSON-schema modes and is still the path of least resistance on providers that have schema-validated tool calls but not (yet) schema-validated free-form output. - Retry-on-validate — let the model produce free-form output, parse it, validate it against your schema in application code, and on failure feed the validation error back into a follow-up call (“you returned
Apr 12 2026; the field expects ISO-8601 date”). This is the original Instructor pattern, and it remains the only path on providers without schema-validated decoding, or when your schema exceeds the strict-mode subset.
In production, you almost always end up running a hybrid: schema-constrained decoding where the provider supports it, retry-on-validate for the semantic constraints that JSON Schema can’t express (cross-field invariants, “the total must equal the sum of line items,” etc.).
Intuition: typed contracts between two services
Treat the LLM as an upstream service you don’t fully trust. Your application is the downstream consumer. The right way to integrate two services that don’t share a codebase is a schema: a machine-checkable contract that says what fields exist, what types they have, which are required. The contract is enforced at the boundary so the downstream code can write invoice.issue_date.year without runtime checks.
Without a structured-output layer, the LLM-to-app boundary is a string. Your downstream code becomes the validator, the parser, and the error-handler all at once. With a structured-output layer, the boundary moves up the stack: the provider (or a wrapping library) does the validation and your code receives a typed object. The amount of defensive code you delete is the measure of the layer’s value.
The distributed-systems parallel
This is wire-format contracts, the same problem solved by Protobuf, Avro, and OpenAPI. A schema-on-write system validates at production time — Protobuf’s encoder rejects a message that doesn’t match the .proto — and a schema-on-read system validates at consumption time. Schema-constrained decoding is schema-on-write: the LLM cannot emit anything off-schema because the decoder won’t let it sample those tokens. Retry-on-validate is schema-on-read: the LLM emits whatever, your application parses and validates, and a failure triggers either a retry or an error.
The trade-off is identical to the one Avro and Protobuf made decades ago. Schema-on-write is more efficient (no parsing failures, no retries, no defensive code) but requires the producer to know the schema. Schema-on-read is more permissive (the producer can emit anything and the consumer copes) but pushes cost and complexity into every consumer. Production LLM systems almost always end up with schema-on-write at the API boundary and schema-on-read for cross-field semantic checks the schema language can’t express — exactly the way real-world Protobuf services have application-level validation on top of the wire format.
There’s a second parallel worth flagging: finite-state automata as a generation primitive. Schema-constrained decoding compiles the JSON Schema into a finite-state machine over the tokenizer’s vocabulary. At each decode step, the FSM’s current state determines which tokens are legal, and the decoder masks all illegal tokens to -inf logit before sampling. This is the same machinery that compiles a regular expression into a state machine for fast matching — the Outlines library and its dependency outlines-core ship that compilation in Rust precisely so the per-token cost stays in the microseconds. The connection to the sampling step covered in the inference fundamentals article is direct: structured output is sampling with a vocabulary mask that changes every step.
Mechanics: the four layers in detail
JSON mode (legacy). The model is post-trained to emit valid JSON when the response_format flag is set. There is no schema enforcement — the model might return {"answer": "yes"} when you wanted {"verdict": "yes", "confidence": 0.8}, and the API will happily return it. Useful only as a syntax guarantee; treat as a checkbox you used to need and rarely set in 2026 outside legacy code paths.
Schema-constrained decoding (the production default). You send a JSON Schema; the provider compiles it once (the first request with a given schema may add 100–500ms of latency for compilation, subsequent calls hit a cache) and uses it as a decode-time mask. Three constraints to know:
- Strict-mode subset. Both OpenAI and Anthropic’s strict modes only accept a subset of full JSON Schema. The recurring restrictions: every property must be
required(use a union withnullfor optional fields),additionalProperties: falseis mandatory at every object level, nesting depth and total property count are capped (OpenAI’s cap is currently 5 levels and 100 properties; check the docs for current numbers). Schemas that exceed the subset fall back to retry-on-validate or non-strict mode silently — read the error response carefully. formatvalidators.format: "date","date-time","uuid","email"are the load-bearing ones. They turn the prose-vs-types failure mode above into a decode-time impossibility.- Refusals are first-class. When the model refuses a request, OpenAI returns a
refusalfield instead of the parsed object. Treat the refusal as a real signal, not an exception — most production code should branch on it explicitly rather than throw.
Tool-use coercion. Define a single tool, e.g. submit_invoice, whose input_schema is your output schema. Set tool_choice to force the model to call it. The tool’s arguments are the structured output. This pattern works on every provider with schema-validated tool calls and predates the dedicated JSON-schema response format by over a year. It composes naturally with multi-tool agent loops — you can let the model search for context with one set of tools and then submit a structured answer with another. The function-calling and tool-use article goes deep on tool use as a generation primitive.
Retry-on-validate. The fallback. Your library parses the response, runs it through Pydantic/Zod, and on ValidationError makes a second call with the parsed error injected into the prompt: “your previous response had line_items[2].quantity = -1, which violates quantity > 0. Try again.” The cost is one extra round-trip per failure. The win is that you can enforce arbitrary semantic constraints — cross-field invariants, business rules, anything you can write as a Pydantic validator — that JSON Schema can’t express.
Code: Python with Instructor + Pydantic
Instructor is the workhorse library on the Python side — it patches the major SDK clients and routes through schema-constrained decoding where available, falling back to retry-on-validate where it isn’t. The library has several million monthly downloads as of mid-2026 and supports OpenAI, Anthropic, Google, Mistral, Cohere, and 15+ other providers behind a unified response_model= argument. Install: pip install instructor anthropic pydantic.
| |
Two points worth flagging. First, the JSON Schema layer enforces structure and types (date format, currency enum, non-negative integers) at decode time — Apr 12 2026 is not a legal date value, so it cannot be sampled. Second, the model_validator enforces a semantic constraint (line items sum to the total) that no JSON Schema dialect can express. If the model passes the decoder but fails the semantic check, Instructor retries with the validation error in the prompt — a hybrid of schema-on-write at the boundary and schema-on-read for the rest, exactly as the distributed-systems analogue prescribes.
Code: TypeScript with the Vercel AI SDK + Zod
On the TypeScript side, the Vercel AI SDK’s generateObject is the equivalent ergonomic layer. It takes a Zod schema, derives the JSON Schema for the provider, runs schema-constrained decoding where supported, and returns a typed object. Install: npm install ai @ai-sdk/anthropic zod.
| |
The shape mirrors the Python version: z.iso.date() and z.enum(...) are the type-level constraints enforced at decode time; the .refine() clause is the semantic constraint enforced after parsing. generateObject will retry on validation failure, feeding the Zod error back to the model. The returned object is typed as z.infer<typeof Invoice> — your IDE knows the shape, no any anywhere.
Trade-offs, failure modes, gotchas
The strict-mode subset bites silently. A schema with optional fields written the natural way (z.string().optional()) compiles to a JSON Schema the strict decoder may reject — strict mode requires every field be in required and “optional” is expressed as a union with null. Both Instructor and generateObject translate this for you, but if you hand-roll the JSON Schema, expect 400 errors that read like nonsense until you read the spec. Test your schema against a real call early.
Constrained decoding can subtly degrade quality. Forcing the model to sample inside a narrow vocabulary at every step is not free — there’s a small but real distribution shift between “the model’s natural answer” and “the most likely answer the schema allows.” For most production schemas the cost is negligible; for tightly constrained schemas with long enums and many fields, it shows up as worse reasoning (Aidan Cooper’s guide to constrained decoding has good intuition). The mitigation is the standard one: don’t over-constrain. If the model needs to “think” before answering, add an unconstrained reasoning string field before the structured fields.
Schema compilation cost on first call. The first time a provider sees a new schema, it compiles it to an FSM. That can add hundreds of milliseconds. Subsequent calls hit a cache. In a latency-sensitive endpoint, warm the cache with a synthetic call at deploy time; in a batch job, ignore it.
Refusals are not exceptions. When the model refuses, you get a structured refusal field (OpenAI) or an empty response (Anthropic, depending on the variant). Treat refusals as a first-class outcome in your code path, especially in eval harnesses — a refused response and a malformed response are different failure modes and should be counted separately.
Tool-use vs response_format is a real choice. If your call is a single-shot extraction, response_format: json_schema is cleaner. If your call sits inside an agent loop where the model also has retrieval tools, expressing the final answer as a tool — submit_answer(structured_answer) — composes more naturally with the rest of the loop. Both patterns work; pick one per code path and stay consistent.
Open-weights and local inference live on Outlines. For Llama, Mistral, Qwen, or anything served on vLLM/TGI, Outlines and its grammar-constrained generation are the canonical structured-output layer. The grammars extend beyond JSON Schema — you can constrain to a SQL grammar, a Python AST, or any context-free grammar. We’ll come back to this in the Constrained Decoding article later in this subtree.
Evaluate the parser, not just the model. A degrading structured-output pipeline often shows up as a slow rise in retries or validator failures before it shows up as a measurable quality drop. Log the validation error rate as a first-class metric next to answer quality — the RAG evaluation harness discussion of the NaN-on-bad-JSON failure mode in Ragas is the same problem at a different layer of the stack.
Further reading
- OpenAI — Introducing Structured Outputs in the API — the original launch post, August 2024. Read it for the mechanism (schema → FSM → decode-time mask) and for the strict-mode subset rules that every provider has converged on.
- Anthropic — Structured Outputs (Claude API docs) — the current docs for Anthropic’s beta. Covers both the JSON-output mode and the strict tool-use variant, with the schema subset constraints.
- Aidan Cooper — A Guide to Structured Outputs Using Constrained Decoding — a careful walkthrough of how the FSM-over-vocabulary trick actually works, what it costs, and when the constraint hurts model quality.
- Jason Liu — Instructor docs — the Pydantic-first guide to LLM structured output, by the library’s author. The “Tutorials” and “Cookbook” sections double as a survey of common extraction patterns.
What to read next
- Function Calling and Tool Use — the direct sequel. Tool-use coercion is the bridge: same FSM-over-vocabulary machinery, used to ask the runtime for side effects rather than to extract a typed payload.
- LLM Inference: Tokens, Context, and Sampling — the sampling primitives structured output piggybacks on. The vocabulary-mask trick is just
softmax(logits + mask)with most entries set to-inf. - Context Engineering: JIT vs AOT Context Loading — the input-side mirror of this article. Structured payloads beat prose on both ends of the call, for the same positional-attention reasons.
- RAG Evaluation: Recall, Faithfulness, and Answer Quality — uses structured output (Ragas in Python,
generateObjectin TypeScript) to implement LLM-as-judge. The NaN-on-bad-JSON failure mode is exactly what schema-constrained decoding is designed to eliminate.