JSON to DuckDB: read_json(), JSON Type, and Fast Local Analytics

Last updated:

DuckDB is the in-process analytical database that turned JSON into a first-class analytical citizen — its built-in JSON extension parses files at 100k to 1M records per second on commodity hardware, infers schemas automatically, and exposes the same -> and ->> operators PostgreSQL users already know. You can query a local .json file, a gzipped NDJSON stream, or an entire S3 prefix with the same read_json function and no separate load step. For anything beyond a one-shot jq filter, DuckDB is the fastest way to get from raw JSON to an answer. This page is a practical reference: how read_json infers schemas, when to declare columns explicitly, the difference between the JSON type and STRUCT, the path operators, NDJSON handling, exporting back to JSON, and how DuckDB compares to PostgreSQL JSONB and SQLite JSON.

Hit a parse error mid-load? Paste the offending JSON into Jsonic's JSON Validator — it pinpoints the exact line, the unbalanced bracket, or the trailing comma that broke read_json.

Validate JSON before loading

DuckDB JSON extension: read_json, read_json_auto, json_each

The JSON extension ships built-in with DuckDB since version 0.8 and is loaded automatically on first use — there is no INSTALL json; LOAD json; step on a default build. It adds three families of capability: file readers (read_json, read_json_auto), a JSON column type with path operators, and table-producing functions for iterating nested data (json_each, json_tree).

read_json and read_json_auto are aliases — the _auto name predates 0.8 and is kept for backward compatibility. Both functions sample the input, infer a schema, and return a relation. The key arguments worth knowing:

  • format'array' (default for files starting with [), 'newline_delimited' for NDJSON, or 'auto' to detect
  • columns — a struct literal declaring the schema explicitly: {id: 'BIGINT', name: 'VARCHAR'}
  • sample_size — number of records to sample for inference; -1 scans the whole file
  • maximum_depth — how deep to expand nested objects into STRUCT (defaults to -1, no limit)
  • union_by_name — when reading multiple files via a glob, union columns by name rather than position

json_each takes a JSON value and produces one row per element of the top-level array or object — useful when a single column contains an array you want to unnest into rows. For deep, recursive walks, json_tree emits every node with its path. Both are common building blocks for log-parsing pipelines.

Loading JSON: auto-detect schema vs explicit columns

The fastest way to start is auto-detect. Open the CLI, point at a file, and select:

-- DuckDB CLI: query a JSON file directly
SELECT *
FROM read_json('users.json')
LIMIT 5;

-- Inspect the inferred schema
DESCRIBE SELECT * FROM read_json('users.json');

DuckDB samples ~20,480 records, infers types, and returns columns. That works well when every record has the same shape and types are consistent. When records drift — a field that is sometimes a string and sometimes a number, or sometimes missing entirely — auto-detect can produce surprising results. Two failure modes show up in practice: a column that is always null in the sample gets dropped or typed as NULL, and a column that switches between integer and string collapses to VARCHAR or JSON.

For production loads, declare the schema explicitly. The cost is a few lines; the payoff is predictable, type-safe, faster parsing (no sampling pass) and clear errors when input drifts:

-- Explicit columns + format for an NDJSON event stream
CREATE TABLE events AS
SELECT *
FROM read_json(
  'events/*.jsonl',
  format = 'newline_delimited',
  columns = {
    id: 'BIGINT',
    ts: 'TIMESTAMP',
    user_id: 'BIGINT',
    event_type: 'VARCHAR',
    payload: 'JSON'
  }
);

Notice payload: 'JSON' — the column keeps its raw JSON shape for later inspection while every other field is typed. This is the pattern most analytical loads end up at: typed top-level columns plus a JSON catchall for the variable payload.

JSON type vs STRUCT type: when to use each

DuckDB gives you two storage choices for nested data: the JSON logical type (parsed JSON kept in a string-like representation) and the STRUCT physical type (each sub-field stored as its own typed column). The difference is the same as the PostgreSQL JSONB vs typed-column choice, and the trade-off matters for both query speed and storage size.

PropertyJSON typeSTRUCT type
Schema known up frontNo — shape can vary per rowYes — declared at column-define time
Access syntaxcol->'name', json_extract(col, '$.name')col.name (dot notation)
StorageOne blob per row (parsed JSON)One typed array per sub-field
Column pruningNo — reads full blobYes — reads only referenced sub-fields
CompressionPer-row blob compressionPer-column type-aware compression
Best forVariable shape, audit logs, raw payloadKnown-shape analytics columns

The practical rule: if the shape is stable across rows, promote to STRUCT immediately and address sub-fields with dot notation. Queries get the full benefit of DuckDB's vectorized column store. Keep JSONfor genuinely variable data — raw webhook payloads, mixed-schema logs, anything where today's new field is tomorrow's removed field. You can mix the two freely in one table.

JSON path queries: -> and ->> operators (PostgreSQL-compatible)

DuckDB borrows PostgreSQL's arrow operators verbatim, which means most JSONB query patterns you already know transfer directly. -> extracts a value and returns it as JSON (so you can keep chaining); ->> extracts and casts to TEXT (the final hop).

-- Sample data: events.payload is JSON
-- { "user": { "id": 42, "email": "a@b.com" }, "tags": ["alpha", "beta"] }

-- Drill in with -> (result stays JSON, supports further chaining)
SELECT payload -> 'user' -> 'id' AS user_id_json
FROM events;

-- Final hop with ->> (result is TEXT — ready for comparison and output)
SELECT payload -> 'user' ->> 'email' AS email
FROM events
WHERE payload -> 'user' ->> 'email' LIKE '%@example.com';

-- Numeric keys index arrays
SELECT payload -> 'tags' ->> 0 AS first_tag
FROM events;

-- Combine with json_array_length for shape checks
SELECT id
FROM events
WHERE json_array_length(payload -> 'tags') > 3;

Both operators work against the JSON type and against VARCHAR columns that happen to contain JSON text. The VARCHAR case re-parses on every call — fine for one-off exploration, expensive in a hot loop. If you need to navigate the same VARCHAR column many times, cast it once to JSON in a CTE or materialized view.

The mental rule: use -> while you are still digging into nested structure, switch to ->> on the final hop where you want a plain string. Mixing them up is the most common source of confusing type errors in WHERE clauses.

json_extract, json_extract_string, and casting

The arrow operators are compact but get awkward past two levels. json_extract takes a JSONPath string and walks arbitrary depth in one call — the readability trade-off swings to json_extract once you are three levels deep or want to use array ranges.

-- Equivalent extractions
SELECT
  payload -> 'user' -> 'address' ->> 'city'    AS arrow_form,
  json_extract_string(payload, '$.user.address.city') AS extract_form
FROM events;

-- json_extract returns JSON; json_extract_string returns VARCHAR
-- For numeric output, cast explicitly
SELECT CAST(json_extract(payload, '$.user.id') AS BIGINT) AS user_id
FROM events
WHERE CAST(json_extract(payload, '$.user.id') AS BIGINT) > 1000;

-- JSONPath also supports arrays
SELECT json_extract(payload, '$.tags[0]')      AS first_tag,
       json_extract(payload, '$.tags[*]')      AS all_tags,
       json_extract(payload, '$.tags[0:3]')    AS first_three;

-- Cast a JSON value to a typed scalar in one step
SELECT (payload -> 'user' ->> 'id')::BIGINT AS user_id
FROM events;

json_extract_string is a convenience that combines extraction and TEXT casting — equivalent to ->> applied at the end of a path. For typed scalar outputs (BIGINT, DOUBLE, BOOLEAN), wrap the extraction in a CAST or use the :: shorthand. DuckDB parses the value once and emits the typed result — no string round-trip.

For exploration, json_structure(json_col) returns a JSON-shaped description of the inferred schema for a JSON value — the easiest way to learn the shape of a column you did not ingest yourself.

Loading NDJSON / JSON Lines with format='newline_delimited'

NDJSON (also called NDJSON/JSONL format) is the dominant export format for log streams, BigQuery, Kafka, and most streaming pipelines: one self-contained JSON object per line, no surrounding array. DuckDB handles it natively — pass format='newline_delimited' and every line becomes a row.

-- Single NDJSON file
SELECT count(*)
FROM read_json('events.jsonl', format = 'newline_delimited');

-- Glob across a partitioned export — all files become one logical relation
SELECT count(*) AS rows, count(DISTINCT user_id) AS uniq_users
FROM read_json(
  'logs/dt=2026-05-*/part-*.jsonl.gz',
  format = 'newline_delimited',
  union_by_name = true
);

-- Hive-style partition columns extracted from path
SELECT dt, event_type, count(*)
FROM read_json(
  'logs/dt=*/part-*.jsonl',
  format = 'newline_delimited',
  hive_partitioning = true
)
GROUP BY dt, event_type
ORDER BY dt;

Gzip (.gz) and zstd (.zst) files are decompressed transparently — no preprocessing or temp files needed. union_by_name matters when the export schema drifts across days: it merges columns by name across files and fills missing ones with NULL, instead of failing on the first shape mismatch.

hive_partitioning = true tells DuckDB to parse key=value path segments as virtual columns — so a query like WHERE dt = '2026-05-20' prunes entire files from the scan before reading a byte. This is the same partition-pruning behavior Spark and Presto users expect, available here without a metastore.

Writing JSON: COPY ... TO format='json'

Round-tripping out of DuckDB to JSON uses the same COPY ... TO statement that handles Parquet and CSV. The default emits a single JSON array (one giant array of row objects); pass ARRAY false for NDJSON instead.

-- Export a query as a JSON array (one file = one array)
COPY (
  SELECT id, name, tags FROM users WHERE active
) TO 'active-users.json' (FORMAT JSON);

-- Export as NDJSON — one object per line, friendly for streaming consumers
COPY (
  SELECT id, name, tags FROM users WHERE active
) TO 'active-users.jsonl' (FORMAT JSON, ARRAY false);

-- Compressed NDJSON, written straight to S3 via httpfs
COPY events
TO 's3://my-bucket/exports/events.jsonl.gz'
(FORMAT JSON, ARRAY false, COMPRESSION 'gzip');

-- Single-row JSON via the to_json scalar function
SELECT to_json(t) AS row_json
FROM users t
WHERE id = 42;

DuckDB serializes nested STRUCT and LIST columns natively — STRUCT becomes a JSON object, LIST becomes a JSON array — so a round trip through Parquet, then back to JSON, preserves shape exactly. The JSON typed columns pass through unchanged, written as parsed JSON values rather than re-escaped strings.

to_json(row) is the scalar form, useful when you want a single column of JSON-serialized rows alongside other columns. to_json(row, indent => 2) pretty-prints; skip the option for compact output. For exporting a whole table as one JSON document, the COPY form is faster and more memory-friendly.

DuckDB JSON vs PostgreSQL JSONB vs SQLite JSON

All three databases offer JSON support, and the surface looks similar — same arrow operators, same json_extract idea — but the storage model and intended workload differ enough that picking the right tool matters.

PropertyDuckDBPostgreSQL JSONBSQLite JSON
WorkloadOLAP — column scans, aggregationsOLTP — row-level reads/writesOLTP — embedded, small dataset
StorageColumnar (per-field arrays)Row-based (binary JSON blob)Row-based (text by default; JSONB since 3.45)
Path operators->, ->>, json_extract->, ->>, jsonb_extract_path->, ->>, json_extract
JSON indexingZone maps + bloom on STRUCT fieldsGIN index on JSONB pathsExpression indexes on extracted paths
File-as-table readsread_json on local files, S3, HTTPNo — must INSERT firstNo — must INSERT first
NDJSON supportNative via format flagManual line-by-line loadManual
STRUCT promotionCast JSON to STRUCT directlyNot applicable — JSONB stays a blobNot applicable

The headline difference: DuckDB treats JSON files as queryable tables without any load step, and once loaded it stores nested data column-wise so analytical queries scan only the fields they reference. PostgreSQL JSONB is faster for point lookups inside a JSONB document and integrates with transactional workloads, but it stores each row's JSON as a single blob — so scanning a million rows to compute an aggregate over one nested field still reads every blob. SQLite sits between the two on scale; great for embedded use but not built for terabytes.

Sibling guides for warehouse-scale workloads: see JSON to BigQuery, JSON to Snowflake, and JSON to ClickHouse. DuckDB is the local, free, no-server option in that family — same OLAP shape, no infrastructure.

Key terms

read_json / read_json_auto
DuckDB table-producing functions that parse a JSON file (or glob of files) and return a typed relation. The two names are aliases — _auto is kept for backward compatibility with pre-0.8 code.
JSON type
DuckDB's logical type for storing parsed JSON values inside a column. Supports path operators (->, ->>) and conversion to other types via CAST or json_extract. Best for variable-shape data.
STRUCT type
DuckDB's physical nested type — each sub-field is stored as its own typed column, addressable with dot notation. Faster than JSON for known-shape data because the column store can prune and vectorize per sub-field.
NDJSON / JSON Lines
A JSON format where each line is one self-contained JSON object, with no surrounding array. DuckDB reads it via format = 'newline_delimited' and treats every line as a row.
httpfs extension
DuckDB extension that adds support for http://, https://, s3://, and gs:// paths to all file-reading functions, including read_json. Install once and remote URIs work everywhere local paths do.
hive_partitioning
A reading mode where key=value segments in a file path are exposed as virtual columns on the relation. Enables partition pruning so WHERE clauses skip entire files before any bytes are read.

Frequently asked questions

How do I load a JSON file into DuckDB?

The shortest path is SELECT * FROM read_json('file.json'). The JSON extension ships built-in with DuckDB since version 0.8, so no INSTALL step is needed on a default build. read_json scans the file, samples records to infer a schema, and returns a relation you can query immediately or materialize with CREATE TABLE t AS SELECT * FROM read_json('file.json'). If the file is a JSON array of objects, every object becomes a row; if it is NDJSON (one object per line), pass format='newline_delimited'. For one-off ad-hoc queries against a remote URL, install the httpfs extension once and read_json accepts http://, https://, s3://, and gs:// paths directly — no download step required. For very large files, the COPY ... FROM 'file.json' syntax is also supported and is sometimes faster than SELECT for pure-ingest workloads.

Does DuckDB auto-detect a JSON file's schema?

Yes. read_json (and the alias read_json_auto) samples the first N records — by default 20,480 — and infers column names, types, and nesting from what it sees. The result is a typed relation: integers become BIGINT, decimals become DOUBLE, strings become VARCHAR, arrays become LIST, and nested objects become STRUCT with sub-columns you can reference by dot notation. Auto-detection is fast and accurate for well-shaped data, but two gotchas apply. First, columns that are absent or null in every sampled record may be missing or typed as NULL — increase the sample with sample_size parameter or pass an explicit columns map. Second, columns that switch types between records (sometimes integer, sometimes string) collapse to JSON or VARCHAR depending on the union_by_name and maximum_object_size settings. For predictable production loads, declare columns explicitly.

What's the difference between -> and ->> in DuckDB?

DuckDB borrows the PostgreSQL JSON path operators verbatim. The single arrow -> extracts a value and returns it as JSON — useful when you want to keep navigating, e.g., data->'user'->'address'->'city' all chain because every intermediate result is still JSON. The double arrow ->> extracts a value and casts it to TEXT (VARCHAR) — useful when the next thing you need is a string for comparison, output, or grouping: data->>'email' yields a plain string you can WHERE against without quoting. Numeric keys index arrays: data->0 returns the first element as JSON, data->>0 returns it as text. The mental rule: use -> while you are still digging, use ->> on the final hop. Both operators work on the JSON type and on VARCHAR columns that contain JSON text; the latter is parsed implicitly on each call, so for hot paths cast once to JSON or use json_extract with a JSONPath.

How do I load NDJSON / JSON Lines into DuckDB?

Pass format='newline_delimited' (or format='nd') to read_json: SELECT * FROM read_json('events.jsonl', format='newline_delimited'). DuckDB then treats each line as an independent JSON object and stitches them into a single relation with one row per line — the natural shape for log streams, Kafka exports, and BigQuery JSON exports. If you do not pass the format argument, DuckDB tries to auto-detect: a file starting with [ is treated as a JSON array, while a file starting with { on every line is treated as NDJSON. Auto-detection is usually right but explicit is faster and more reliable for scripts. For gzipped NDJSON, DuckDB transparently decompresses .gz and .zst files — no preprocessing needed. Glob patterns work too: read_json('logs/*.jsonl') reads every matching file as one logical relation, which is the easiest way to consume a partitioned export.

Can DuckDB query JSON directly from S3 or HTTP without loading?

Yes. Install and load the httpfs extension once (INSTALL httpfs; LOAD httpfs;) and read_json accepts s3://, gs://, https://, and http:// URIs as if they were local paths. For S3, configure credentials via the standard AWS environment variables or with CREATE SECRET — DuckDB picks them up automatically. The query runs streaming: only the bytes needed to satisfy the query are pulled across the network, so SELECT count(*) FROM read_json('s3://bucket/data/*.json') is feasible against terabytes of data without downloading anything. Range requests on Parquet are smarter than on JSON (JSON has no column index), but DuckDB can still prune entire files by filename when you use Hive-style partitioning. For repeated queries against the same remote data, materialize a local copy with CREATE TABLE t AS SELECT * FROM read_json('s3://...') — subsequent queries hit the local column store at full speed.

How do I convert a JSON column to STRUCT?

Two paths. If you know the shape, cast directly: SELECT CAST(payload AS STRUCT(id INTEGER, name VARCHAR, tags VARCHAR[])) FROM events. DuckDB parses the JSON value and produces a typed STRUCT you can address with dot notation (struct_col.id) — much faster on repeated access than calling -> on every row. If you do not know the shape, use json_structure to read the inferred schema as a string and json_transform to convert in one pass: SELECT json_transform(payload, '{"id": "INTEGER", "name": "VARCHAR"}') FROM events. The result is a STRUCT with typed sub-columns. STRUCT is the preferred storage form for known-shape data because it stores each sub-column as its own array — DuckDB skips parsing on every query and gets full column-store benefits (compression, vectorization, late materialization). Use the JSON type only when the shape is genuinely variable or when you need json_each-style iteration.

How do I export a DuckDB table as JSON?

Use COPY ... TO with format='json'. The default writes a JSON array (one giant array of row objects) which is friendly for tools that expect file-as-JSON: COPY users TO 'out.json' (FORMAT JSON). For streaming downstream consumers — log shippers, Kafka producers, downstream DuckDB loads — pass ARRAY false to emit NDJSON instead: COPY users TO 'out.jsonl' (FORMAT JSON, ARRAY false). DuckDB handles nested STRUCT and LIST columns natively — they become JSON objects and arrays in the output, so a round trip through Parquet or DuckDB preserves nested shape. Combine with COMPRESSION 'gzip' for compressed output, or write directly to s3:// when the httpfs extension is loaded. For one-row, one-column extracts (e.g., serializing a single row as JSON), the to_json scalar function works inside SELECT: SELECT to_json(t) FROM users t LIMIT 1.

Is DuckDB JSON faster than jq for analytics?

For anything beyond a one-off transform, yes — by one to two orders of magnitude on commodity hardware. jq is a streaming text processor: it parses, transforms, and emits text in a single pass with limited parallelism. DuckDB parses JSON into a columnar in-memory representation, vectorizes operators (filters, aggregations, joins) across batches of rows, and uses every available CPU core. On a modern laptop, read_json sustains 100k to 1M records per second for typical NDJSON workloads, and follow-up aggregations run at near-native columnar speeds. Where jq still wins: pipeline-friendly one-shot transforms (jq '.users[] | select(.active)' file.jsonl is easier than the DuckDB equivalent for a single throwaway query), and environments without DuckDB installed. For repeated queries, joins, or anything aggregating more than ~10 MB of JSON, switch to DuckDB and feel the difference immediately.

Further reading and primary sources