Get Reliable JSON Output from LLMs: OpenAI, Claude, Gemini, and Llama

Last updated:

Every major LLM provider now ships a structured-output mode that returns JSON conforming to a schema you supply, at the token level, with no markdown fences or trailing prose. OpenAI calls it Structured Outputs and claims 100% schema adherence on gpt-4o-2024-08-06 and later; Anthropic Claude uses tool_use blocks with a forced tool_choice; Google Gemini exposes responseSchema with responseMimeType: application/json; local models like Llama 3 and Mistral get the same guarantee through grammar-constrained decoding in llama.cpp or the Outlines library. The right pattern is the same across providers — define a JSON Schema, pass it to the API, validate the response client-side with Zod or Pydantic, and retry on failure. The wrong patterns — asking for JSON in a system prompt and hoping, or using a regex to strip markdown fences after the fact — work most of the time and fail in production. This guide shows the working patterns for each provider with real code, and the validation and retry layer that should wrap all of them.

Got JSON back from an LLM and it's not parsing? Paste it into Jsonic's JSON Schema Validator to check both syntax and schema conformance in one pass — it shows the exact key path of every violation.

Validate LLM JSON

Why LLMs produce malformed JSON (and how each provider fixes it)

Base LLMs are sequence predictors trained on a mixture of natural language and code. When you ask one for JSON in a free-form prompt, it samples tokens from the same distribution it learned from — a distribution dominated by Markdown-fenced code blocks, explanatory prose, and conversational scaffolding like "Here is the JSON you requested:". The model is not consulting an internal JSON grammar; it is choosing the locally most-likely next token. That is why the failure modes are so consistent: triple-backtick fences, leading or trailing English sentences, smart quotes substituted for ASCII quotes, trailing commas in arrays, and occasionally hallucinated keys that looked plausible given the context.

The fix every provider settled on is the same idea implemented three different ways: constrain the token sampler to only emit tokens that produce a string conforming to a supplied schema. OpenAI calls this Structured Outputs and ships it as response_format: { type: "json_schema", strict: true }. Anthropic implements it through tool_use — the tool input schema becomes the grammar. Google Gemini exposes the same capability through responseSchema. Local model runners do it explicitly: llama.cpp takes a GBNF grammar file, Outlines compiles a JSON Schema into a finite-state machine that masks invalid tokens at each step.

The end result, regardless of provider, is that malformed JSON becomes mechanically impossible — not unlikely, but impossible — as long as you stay inside the strict mode. If you fall back to free-form prompting for any reason, the failure modes return.

OpenAI: response_format json_object vs json_schema

OpenAI ships two structured modes. JSON mode ({ type: "json_object" }) landed in November 2023 and guarantees that the response is valid parseable JSON. It does not guarantee anything about shape — the model picks the keys. The system prompt must contain the literal word "JSON" or the API returns a 400.

Structured Outputs ({ type: "json_schema", json_schema: {...}, strict: true }) shipped in August 2024 and adds token-level schema constraints. It requires gpt-4o-2024-08-06 or later and supports a subset of JSON Schema: no oneOf at the root, all properties must be in required, and additionalProperties must be false on every object. The tradeoff for those constraints is a 100% schema adherence guarantee on OpenAI internal evals, versus roughly 35-40% for plain JSON mode on the same prompts.

import OpenAI from "openai"
import { z } from "zod"
import { zodResponseFormat } from "openai/helpers/zod"

const Recipe = z.object({
  name: z.string(),
  servings: z.number().int().min(1),
  ingredients: z.array(z.object({
    item: z.string(),
    quantity: z.string(),
  })),
})

const openai = new OpenAI()
const response = await openai.chat.completions.parse({
  model: "gpt-4o-2024-08-06",
  messages: [
    { role: "system", content: "Extract a recipe as structured JSON." },
    { role: "user", content: "Make me a simple pasta dish." },
  ],
  response_format: zodResponseFormat(Recipe, "recipe"),
})

const recipe = response.choices[0].message.parsed
// recipe is fully typed and schema-validated

The zodResponseFormat helper converts a Zod schema into the JSON Schema shape OpenAI expects and re-applies it for client-side type narrowing — the parsed field comes back fully typed. See our OpenAI Structured Outputs deep dive for the schema subset rules and edge cases, and the OpenAI JSON mode deep dive for when the older mode is still the right tool.

Anthropic Claude: tool_use blocks for structured output

Claude has no response_format flag. The Anthropic-recommended path for structured JSON is tool_use: declare a single tool whose input_schema is the JSON Schema of your desired output, then force the model to call it by setting tool_choice to { type: "tool", name: "..." }. This pattern has been the canonical approach since Claude 3 launched in March 2024 and works identically on Sonnet, Opus, and Haiku.

import Anthropic from "@anthropic-ai/sdk"

const client = new Anthropic()
const response = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  tools: [{
    name: "record_recipe",
    description: "Record a structured recipe",
    input_schema: {
      type: "object",
      properties: {
        name: { type: "string" },
        servings: { type: "integer", minimum: 1 },
        ingredients: {
          type: "array",
          items: {
            type: "object",
            properties: {
              item: { type: "string" },
              quantity: { type: "string" },
            },
            required: ["item", "quantity"],
          },
        },
      },
      required: ["name", "servings", "ingredients"],
    },
  }],
  tool_choice: { type: "tool", name: "record_recipe" },
  messages: [{ role: "user", content: "Make me a simple pasta dish." }],
})

const toolUse = response.content.find(b => b.type === "tool_use")
const recipe = toolUse?.input  // your structured payload

The tool call payload lives at response.content[N].input for the first block of type tool_use. Anthropic validates the schema server-side before returning, so what you receive parses and conforms. See our Claude tool_use guide for streaming tool calls, multi-tool selection, and the schema subset Anthropic supports.

Google Gemini: responseSchema and responseMimeType

Gemini exposes structured output through generationConfig. Set responseMimeType to application/json for plain JSON mode, or add responseSchema for schema-constrained output. The responseSchema field went generally available with Gemini 1.5 Pro in May 2024 and is supported on every newer model including the Gemini 2.5 series.

import { GoogleGenerativeAI, SchemaType } from "@google/generative-ai"

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!)
const model = genAI.getGenerativeModel({
  model: "gemini-2.5-pro",
  generationConfig: {
    responseMimeType: "application/json",
    responseSchema: {
      type: SchemaType.OBJECT,
      properties: {
        name: { type: SchemaType.STRING },
        servings: { type: SchemaType.INTEGER },
        ingredients: {
          type: SchemaType.ARRAY,
          items: {
            type: SchemaType.OBJECT,
            properties: {
              item: { type: SchemaType.STRING },
              quantity: { type: SchemaType.STRING },
            },
            required: ["item", "quantity"],
          },
        },
      },
      required: ["name", "servings", "ingredients"],
    },
  },
})

const result = await model.generateContent("Make me a simple pasta dish.")
const recipe = JSON.parse(result.response.text())

The schema uses a Gemini-specific type enum (SchemaType.STRING, etc.) rather than raw JSON Schema strings, though plain strings also work in most SDK versions. Gemini supports a wider schema subset than OpenAI strict mode — including anyOf at the root and optional fields — but the same advice applies: keep schemas tight, mark required fields explicitly, and avoid deeply nested unions. Output comes back as a JSON string in result.response.text() which you JSON.parse on the client.

Local models (Llama 3, Mistral): grammar-constrained decoding with llama.cpp and Outlines

Open-weight models do not ship with a built-in structured-output API, but the same token-masking technique used by the closed providers is available through two libraries.

llama.cpp accepts a GBNF (GGML BNF) grammar file via the --grammar-file flag or the grammar parameter on the HTTP server endpoint. GBNF is a compact context-free grammar format; the project ships a json.gbnf for arbitrary valid JSON and json_arr.gbnf for arrays, plus a Python script that converts a JSON Schema to a tailored GBNF.

# Generate a GBNF grammar from a JSON Schema
python examples/json_schema_to_grammar.py recipe.schema.json > recipe.gbnf

# Run llama.cpp with the grammar
./llama-cli -m llama-3.1-8b-instruct.gguf \
  --grammar-file recipe.gbnf \
  -p "Make me a simple pasta dish. Return JSON only."

Outlines (Python, version 0.1+) does the same thing at a higher level and works with Hugging Face Transformers, vLLM, llama.cpp, and several other backends. You pass a Pydantic model or a Python type and Outlines compiles it into a finite-state machine that masks the model logits at each generation step.

from outlines import models, generate
from pydantic import BaseModel

class Ingredient(BaseModel):
    item: str
    quantity: str

class Recipe(BaseModel):
    name: str
    servings: int
    ingredients: list[Ingredient]

model = models.transformers("meta-llama/Llama-3.1-8B-Instruct")
generator = generate.json(model, Recipe)
recipe = generator("Make me a simple pasta dish.")
# recipe is a Recipe instance, already validated

Both approaches give the same hard guarantee as OpenAI strict mode — the model cannot emit a token that would invalidate the schema. The runtime cost is slightly higher than closed-provider strict modes (the FSM update happens on the CPU between forward passes) but for production workloads the throughput hit is usually under 10%.

Validating LLM JSON output: Zod, Pydantic, ajv, schema-guided retry loops

Even with token-level constraints, client-side validation is the defense-in-depth layer that catches the rare miss — a model rolling back, a streaming truncation, or an older non-strict endpoint. Validate every LLM response before letting it flow into business logic.

LibraryRuntimeSchema sourceBest for
zodTypeScript/JSZod schema DSLSingle source of truth — Zod → JSON Schema with z.toJSONSchema
ajvJSRaw JSON SchemaWhen schemas come from external sources (OpenAPI, third-party APIs)
pydanticPythonPydantic modelsSame single-source-of-truth pattern in Python; OpenAI/Outlines integrate directly
jsonschemaPythonRaw JSON SchemaPure-Python validator for non-Pydantic codebases

The retry loop pattern: on validation failure, feed the validator error message back to the model as a follow-up turn asking it to fix the listed problems, retry up to N times, then give up and surface a user-facing error.

async function callWithRetry(schema: z.ZodSchema, prompt: string, maxRetries = 3) {
  let lastError: string | null = null
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const messages = [{ role: "user" as const, content: prompt }]
    if (lastError) {
      messages.push({
        role: "user",
        content: `Previous response failed validation: ${lastError}. Fix and retry.`,
      })
    }
    const response = await openai.chat.completions.parse({
      model: "gpt-4o-2024-08-06",
      messages,
      response_format: zodResponseFormat(schema, "output"),
    })
    const parsed = schema.safeParse(response.choices[0].message.parsed)
    if (parsed.success) return parsed.data
    lastError = parsed.error.message
  }
  throw new Error(`Failed after ${maxRetries} retries: ${lastError}`)
}

For the Zod-to-JSON-Schema conversion pattern and the type-narrowing that follows, see our type-safe parsing with Zod guide.

Common failure modes: markdown fences, trailing prose, partial JSON, hallucinated keys

When you cannot use a strict mode — older model, dynamic schema, batch endpoint that does not accept response_format — these are the failure shapes you will see and the repairs that work.

FailureWhat you seeRepair
Markdown fences```json {...} ``` wrapping the payloadStrip leading ```json or ```, strip trailing ```, then parse
Trailing proseValid JSON followed by "I hope this helps!"Parse incrementally; stop at the first balanced top-level object
Leading prose"Sure! Here is the JSON:" then payloadFind the first { or [; trim everything before it
Smart quotesU+201C/U+201D substituted for ASCII "Replace with ASCII before parsing; better, switch to strict mode
Trailing commas[1, 2, 3,]JSON5 parser, or regex strip ,(\s*[\]}])
Truncation (max_tokens hit)Output ends mid-string or mid-arrayCheck finish_reason before parsing; raise max_tokens or implement continuation
Hallucinated keysExtra fields not in your schemaSchema validation with additionalProperties: false; reject or strip extras
Wrong typesNumber returned as string, or vice versaZod/Pydantic coercion (z.coerce.number()) or strict mode

The general principle: every failure mode in this list has a regex or post-processing fix that works most of the time, and none of them work all of the time. The only durable fix is structured output mode. Treat post-processing as a temporary bridge while you migrate to a strict endpoint, not as a long-term solution.

Choosing a strategy: JSON mode vs function calling vs grammar constraints

The three approaches solve overlapping problems with different tradeoffs. This decision table covers the common cases.

NeedRecommended approachWhy
Static schema, OpenAIStructured Outputs (json_schema strict)100% adherence, typed parse helper, single round-trip
Static schema, Claudetool_use with forced tool_choiceOnly native structured path; server-side validation
Static schema, GeminiresponseSchemaNative, supports broader schema subset than OpenAI strict
Static schema, local modelOutlines (Python) or llama.cpp GBNFSame token-level guarantee, runs anywhere
Schema generated at runtimeFunction calling with strict: true (OpenAI), tool_use (Claude)Accepts arbitrary JSON Schema per request
Truly dynamic shapePlain JSON mode + client validationOnly option when shape is not known ahead of time
Streaming UI updatesStrict mode + partial-json parserSchema-valid prefixes stream incrementally
Multi-step agentFunction/tool calling per stepEach tool definition is a structured-output declaration
Batch / asyncStructured Outputs (supported in OpenAI Batch API)Same guarantees as sync endpoint

See our function calling schemas guide for the schema subset rules that apply when defining tools across providers, and the JSON Schema tutorial for the underlying schema dialect every provider expects (Draft 2020-12 with provider-specific subsets).

Key terms

Structured Outputs
OpenAI feature that constrains the model token sampler to emit only outputs that conform to a supplied JSON Schema. Enabled via response_format.json_schema with strict: true. Requires gpt-4o-2024-08-06 or later. Claims 100% schema adherence on internal evals.
JSON mode
The older OpenAI flag response_format: { type: "json_object" }. Guarantees valid parseable JSON but not schema conformance. Requires the word "JSON" in the system prompt. Useful when the schema is dynamic.
tool_use / function calling
The pattern where the model invokes a declared function whose parameters or input_schema is a JSON Schema describing the desired output. Forcing a single tool with tool_choice makes it equivalent to a structured-output mode. Supported by every major provider.
GBNF (GGML BNF)
A compact Backus-Naur form grammar dialect used by llama.cpp to constrain local-model generation. A Python script in the llama.cpp repo converts a JSON Schema to a tailored GBNF grammar.
Outlines
A Python library (version 0.1+) that compiles a Pydantic model or JSON Schema into a finite-state machine which masks invalid tokens at each generation step. Works with Transformers, vLLM, and llama.cpp backends.
responseSchema
The Google Gemini equivalent of OpenAI Structured Outputs, set under generationConfig alongside responseMimeType: application/json. GA since Gemini 1.5 Pro (May 2024).

Frequently asked questions

Why do LLMs sometimes return JSON wrapped in ```json fences?

Base LLMs were trained on enormous amounts of Markdown — GitHub READMEs, Stack Overflow answers, technical blog posts — where code samples are conventionally fenced with triple backticks and a language hint. When you ask a model for JSON without using a structured output mode, it falls back to that learned formatting habit and wraps the payload. The fix is to switch to the provider native structured mode (json_schema on OpenAI, tool_use on Claude, responseSchema on Gemini) which bypasses free-form generation entirely and emits raw JSON tokens. If you cannot use a structured mode, post-process the response with a regex that strips the leading ```json or ``` and the trailing ```, then run JSON.parse. Add the instruction "Return only JSON, no markdown" to the system prompt as a secondary defense, but treat it as best-effort — instructions alone do not eliminate the failure.

What is the difference between OpenAI JSON mode and Structured Outputs?

JSON mode (response_format: { type: "json_object" }) was launched in November 2023 and guarantees that the response parses as valid JSON. It does not guarantee anything about the shape — the model can return any object with any keys, and you must validate the schema yourself. The system prompt must contain the word "JSON" or the API rejects the request. Structured Outputs (response_format: { type: "json_schema", json_schema: {...}, strict: true }) shipped in August 2024 and goes further: the model is constrained at the token level to produce only outputs that satisfy the supplied JSON Schema, with 100% schema adherence according to OpenAI internal evals. It requires gpt-4o-2024-08-06 or later, supports a subset of JSON Schema (no oneOf at root, all properties required, additionalProperties false), and adds roughly 5-15% latency from the constrained decoding step. Use Structured Outputs whenever your schema is known ahead of time.

Does Claude support a native JSON mode like OpenAI?

Claude does not expose a response_format flag. The Anthropic-recommended pattern is tool_use: define a single tool whose input_schema is the JSON Schema you want the model to emit, then force the model to call that tool by setting tool_choice to { type: "tool", name: "your_tool_name" }. The tool call payload in the response is your structured JSON, available at response.content[0].input. This has been the canonical approach since Claude 3 launched in March 2024 and works identically across Sonnet, Opus, and Haiku. The model treats the schema as a hard constraint and tool calls are validated server-side before being returned. The result is functionally equivalent to OpenAI Structured Outputs — same guarantees, same JSON Schema subset support (no top-level union types, required arrays must list every property). The only ergonomic gap is the extra unwrapping step.

Can I use a JSON Schema with function calling?

Yes — function calling (or tool use, depending on provider terminology) is built on JSON Schema. Every provider that supports tools accepts a parameters or input_schema field which is interpreted as Draft 2020-12 JSON Schema. OpenAI added strict: true to function definitions in August 2024 alongside Structured Outputs, giving function calls the same 100% schema adherence guarantee. Claude tool_use schemas have always been validated. Gemini function declarations accept JSON Schema directly via the parameters field. The practical implication: if you want structured output, define it as a function or tool with a JSON Schema for parameters and force the model to invoke it. The function arguments object becomes your structured payload. This works identically across providers and is the most portable structured-output pattern. See our function calling schemas guide for the schema subset each provider supports.

How do I prevent an LLM from hallucinating extra JSON keys?

Use a strict structured-output mode and set additionalProperties to false in your JSON Schema. With OpenAI Structured Outputs in strict mode, additionalProperties: false is required at every object level — the API rejects schemas that omit it — and the token-level constrained decoder physically cannot emit keys outside the schema. Claude tool_use enforces the same on the server. Without strict mode, validation must happen client-side: parse the response, run it through a schema validator (Zod, Ajv, Pydantic), and reject or repair anything with extra keys. The retry pattern is to feed the validator error back into the model with a fresh request asking it to fix the listed problems. For the few stubborn cases — usually older models or providers without strict mode — temperature: 0 reduces variability and explicit examples of correct output in the system prompt help anchor the format.

Which is more reliable: JSON mode or tool/function calling?

Tool calling with a strict JSON Schema is more reliable than free-form JSON mode in nearly every case. The token-level constrained decoder used by tool calls cannot produce malformed JSON or schema violations — it is a hard constraint, not a probabilistic preference. JSON mode (OpenAI plain json_object, Gemini responseMimeType without responseSchema) guarantees parseable JSON but not schema adherence: extra keys, wrong types, and missing required fields all still happen. Empirically, OpenAI internal benchmarks reported around 35-40% schema adherence with JSON mode versus 100% with Structured Outputs on the same prompts. The latency cost is small (5-15% on token-by-token outputs). The exception is when your schema is genuinely dynamic — generated from user input at runtime — in which case JSON mode plus client-side validation may be the only option. For static or semi-static schemas, default to the strict structured mode every time.

How do I handle streaming JSON output from an LLM?

Streaming structured JSON requires a partial JSON parser that tolerates incomplete input. The standard JSON.parse will throw on every chunk until the final token arrives. Use a streaming-aware library: partial-json (npm, ~3KB) parses incomplete JSON by closing open strings, arrays, and objects with sane defaults; best-effort-json-parser does the same with slightly different heuristics. OpenAI Structured Outputs streams tokens that are guaranteed valid prefixes of a schema-conforming object, so as more tokens arrive the parser progressively reveals more fields. The Vercel AI SDK ships a streamObject helper that handles this end-to-end with a Zod schema, emitting incremental updates as keys complete. For tool calls, Anthropic streams tool_use blocks with input_json_delta events — concatenate the deltas, parse on each tick with a partial parser to update UI optimistically, then validate the final accumulated string with strict JSON.parse and your schema validator.

What happens if the LLM exceeds max_tokens mid-JSON?

The response is truncated wherever the token budget runs out — usually mid-string, mid-array, or mid-object — and the partial output will not parse. The finish_reason field in the response will be length (OpenAI), max_tokens (Anthropic), or MAX_TOKENS (Gemini) rather than stop. Always check that field before parsing. Three mitigations: (1) raise max_tokens to a generous ceiling that comfortably exceeds your largest expected payload — token cost only applies to actual generated tokens, not the limit; (2) keep the schema minimal — strip optional descriptive fields, use short property names, return references or IDs instead of full nested objects; (3) implement a continuation strategy where on a length finish you make a second call with the partial output as context and ask the model to complete from where it stopped. For Structured Outputs the partial-JSON parser approach works for graceful degradation but the result is not schema-valid until the full response arrives.

Further reading and primary sources