JSON to ClickHouse: New JSON Type, JSONEachRow Format, and Fast Ingestion

Last updated:

ClickHouse handles JSON better than most analytics engines on the market — and after the 24.10 release in late 2024, the story changed materially. The JSON column type is production-ready, the JSONEachRow format sustains 1–2 million rows per second of insert throughput on a single node, and schema inference via clickhouse-localturns "here is a JSON file" into "here is a ready-to-paste CREATE TABLE" in one command. This guide covers the three formats you actually use (JSONEachRow, JSONCompactEachRow, JSONColumns), the new dynamic JSON type and its alternatives (Map(String, String), Tuple), inserting from files and streams, dot-path querying, performance tuning, and streaming from Kafka through the Kafka engine. Examples target ClickHouse 25.x and assume clickhouse-client or clickhouse-local on the path.

Loading a JSON file and the parser is rejecting it? Paste the source into Jsonic's JSON Validator first — it flags trailing commas, single quotes, unescaped newlines, and the other small issues that make JSONEachRow fail with a cryptic position offset.

Validate JSON before ingest

ClickHouse JSON formats: JSONEachRow vs JSONCompact vs JSONColumns

ClickHouse supports more than a dozen JSON input formats, but three carry nearly all the traffic in practice. The right choice depends on whether the producer controls column order, how much wire overhead you can tolerate, and whether the source is row-oriented or column-oriented.

JSONEachRow is the default and what you reach for first. One JSON object per line, named keys, identical to JSONL/NDJSON. Resilient to column reordering, tolerant of missing fields (they become defaults or NULL), and parseable in parallel via input_format_parallel_parsing=1.

JSONCompactEachRow trades robustness for size. One JSON array per line, values in positional column order — 30–40 percent smaller on the wire and a touch faster to parse. Useful when you control both ends and care about throughput; fragile across schema changes because reordering a column silently corrupts the data.

JSONColumnsflips the layout. Instead of one object per row, the payload is one object per column, each holding an array of values. This matches ClickHouse's columnar internal layout and parses with near-zero transformation overhead — the format of choice when you control the exporter and want the absolute maximum ingest rate.

FormatShapeWire sizeRobust to column reorderBest for
JSONEachRowOne object per line, named keysBaselineYesExternal sources, logs, evolving schemas
JSONCompactEachRowOne array per line, positional~30–40% smallerNoInternal pipelines with stable schema
JSONColumnsOne object total, arrays per columnSmallestYes (by key)Bulk loads from controlled exporters
JSONStringsEachRowLike JSONEachRow but all values are stringsLargerYesSchema-on-read, untyped sources
JSONAsStringOne JSON value per row, stored as StringN/AN/ARaw payload archival, schema-later workflows

If you are arriving from BigQuery, Snowflake, Redshift, or DuckDB, the closest analog to JSONEachRow is NDJSON — the same on-the-wire layout. See the DuckDB JSON comparison for a format-by-format mapping, and the sibling guides for BigQuery, Snowflake, and Redshift.

The new JSON type (production-ready in 24.10, GA in 25.x)

For most of ClickHouse's history, "JSON column" meant either String with JSONExtract functions at query time, or the experimental Object('json') type behind a feature flag. The 24.10 release shipped a redesigned JSON type as production-ready, and the 25.x line consolidated it as the recommended path for semi-structured data.

Under the hood, a JSON column stores each distinct path as its own subcolumn — written columnar, compressed independently, and read on demand. The first time ClickHouse sees payload.user.id, it materializes a subcolumn; the second time, it appends. Paths that change type across rows (sometimes a string, sometimes an integer) are stored as Dynamic, which records the value plus its actual type per row.

-- Pure JSON column
CREATE TABLE events (
  ts        DateTime64(3),
  payload   JSON
) ENGINE = MergeTree
ORDER BY ts;

-- Mixed: typed columns for the hot path, JSON for the rest
CREATE TABLE events_hybrid (
  event_id    UInt64,
  user_id     UInt64,
  event_name  LowCardinality(String),
  ts          DateTime64(3),
  payload     JSON(max_dynamic_paths = 1024, max_dynamic_types = 16)
) ENGINE = MergeTree
ORDER BY (event_name, ts, user_id);

The max_dynamic_paths hint caps how many distinct paths get their own materialized subcolumn before the rest share a fallback store — a safety net for wildly heterogeneous documents where the long tail of paths would otherwise blow up the part count.

Map(String, String) and Tuple alternatives for semi-structured data

The JSON type is the right answer most of the time, but two older options remain useful for specific shapes.

Map(String, String) fits flat key-value bags — HTTP headers, event attributes, environment metadata — where every value is text and you never need numeric or boolean fidelity. Maps support direct subscript access (headers['user-agent']) and bloom filter indexes for fast membership queries. They do not nest, so any structure deeper than one level forces you to flatten or encode as a string.

CREATE TABLE http_logs (
  ts        DateTime64(3),
  method    LowCardinality(String),
  path      String,
  status    UInt16,
  headers   Map(String, String),
  tags      Map(LowCardinality(String), String)
) ENGINE = MergeTree
ORDER BY (ts, status);

-- Query via subscript
SELECT headers['user-agent'], count()
FROM http_logs
WHERE ts > now() - INTERVAL 1 HOUR
GROUP BY 1;

Tuple(...) and named tuples model fixed-shape nested objects — a coordinate pair, an address, a metric snapshot — where every row has the same fields and the same types. Tuples are pure structural typing: no per-row type overhead, columnar storage on each member, and dot-name access in queries.

CREATE TABLE locations (
  user_id    UInt64,
  coord      Tuple(lat Float64, lon Float64),
  address    Tuple(city LowCardinality(String), country FixedString(2))
) ENGINE = MergeTree
ORDER BY user_id;

SELECT coord.lat, address.country FROM locations LIMIT 10;

Picking between them: use a Tuple when the shape is fixed and known at table-creation time, a Map when keys are open but values are uniformly typed, and the JSON type when both keys and value types vary across rows.

INSERT INTO ... FORMAT JSONEachRow from a file or stream

The canonical loader is clickhouse-client with a piped file and a query that ends in FORMAT JSONEachRow. The client streams the input straight to the server without buffering, so file size is bounded by disk and network, not client memory.

# Local file -> table
cat events.jsonl | clickhouse-client \
  --query="INSERT INTO events FORMAT JSONEachRow"

# Gzipped file
gunzip -c events.jsonl.gz | clickhouse-client \
  --query="INSERT INTO events FORMAT JSONEachRow"

# Forgiving load: drop unknown fields, treat empty strings as NULL
cat events.jsonl | clickhouse-client --query="
  INSERT INTO events
  SETTINGS
    input_format_skip_unknown_fields = 1,
    input_format_null_as_default = 1,
    input_format_parallel_parsing = 1
  FORMAT JSONEachRow
"

# Directly from S3 (no intermediate file)
clickhouse-client --query="
  INSERT INTO events
  SELECT * FROM s3('s3://bucket/events/*.jsonl', 'JSONEachRow')
"

The input_format_skip_unknown_fields setting matters more than any other when ingesting from a source you do not fully control. Without it, one new field added upstream breaks the entire INSERT. With it, the field is dropped silently — log the warning if you need to know, but ingestion keeps running.

For one-off transformations without a running server, clickhouse-local runs the same engine in-process against the same set of table functions (file, url, s3, gcs).

Schema inference via desc table + clickhouse-local

Before you write CREATE TABLE by hand, ask ClickHouse what it sees. TheDESC command works against format functions, file paths, URLs, and S3 objects — anywhere the engine can read a sample.

# Infer schema from an inline JSON sample
clickhouse-local --query "
  DESC FORMAT('JSONEachRow', '
    {"id":1,"name":"Alice","ts":"2026-05-23T10:00:00Z","tags":["a","b"]}
    {"id":2,"name":"Bob","ts":"2026-05-23T10:00:01Z","tags":[]}
  ')
"

# Infer from a file
clickhouse-local --query "
  DESC TABLE file('sample.jsonl', 'JSONEachRow')
  SETTINGS
    input_format_try_infer_integers = 1,
    input_format_try_infer_dates = 1,
    input_format_try_infer_datetimes = 1
"

# Infer from S3
clickhouse-local --query "
  DESC TABLE s3('s3://bucket/sample.jsonl', 'JSONEachRow')
"

The inference settings flip several defaults. With input_format_try_infer_integers=1, numeric fields become UInt64/Int64 instead of Float64. With input_format_try_infer_dates=1 and input_format_try_infer_datetimes=1, ISO-8601 strings become typedDate/DateTime64 columns. These defaults are off because they are heuristics — a column that looks numeric in the sample might contain a string later in the file — but for clean sources they save real time.

Use the inferred output as a starting point, not a final answer. Inference cannot know which columns belong in the sort key, which strings should be LowCardinality, or which integers fit in UInt32 instead ofInt64. Tighten by hand before the table goes to production.

Querying JSON columns: paths, subcolumns, materialized views

Reading from a JSON-typed column uses dot-path syntax — every nested field is accessible as if it were a regular column.

-- Direct path access on a JSON column
SELECT
  payload.user.id        AS user_id,
  payload.user.email     AS email,
  payload.event.name     AS event_name,
  count()                AS n
FROM events
WHERE payload.event.name = 'signup'
  AND ts > now() - INTERVAL 1 DAY
GROUP BY user_id, email, event_name
ORDER BY n DESC
LIMIT 100;

-- Cast a dynamic field when the type varies across rows
SELECT
  payload.amount.:Int64    AS amount_int,
  payload.amount.:String   AS amount_str
FROM events
WHERE payload.amount IS NOT NULL;

For columns stored as plain String (legacy tables, raw payload archives), the JSONExtract* family parses on every query. JSONExtractString, JSONExtractInt, JSONExtractFloat, JSONExtractBool, and JSONExtractArrayRaw cover most cases, and simpleJSONExtractString is a faster non-validating variant for shallow keys when you trust the input.

For hot paths, lock in the work with a materialized view. The view runs the JSON-to-columns extraction once at insert time and stores typed values in a regular MergeTree table — queries against the view are pure columnar with zero JSON parsing.

-- Source: raw events with JSON payload
-- Target: pre-extracted typed columns for the dashboard
CREATE MATERIALIZED VIEW events_signup_mv
ENGINE = MergeTree
ORDER BY (ts, user_id)
AS SELECT
  ts,
  payload.user.id        AS user_id,
  payload.user.email     AS email,
  payload.event.source   AS source,
  payload.utm.campaign   AS campaign
FROM events
WHERE payload.event.name = 'signup';

Performance: dynamic JSON vs explicit columns

On a clean benchmark, explicit typed columns beat the JSON type by roughly 2–3x on insert throughput and 1.5–2x on scan-heavy queries against fields that exist in both layouts. That gap shrinks fast once the workload becomes realistic: queries touch only a few paths, the JSON type materializes those paths as subcolumns, and the per-query work converges with the explicit-column case.

WorkloadExplicit columnsJSON typeString + JSONExtract
Bulk insert (1B rows)1.5–2M rows/sec1–1.5M rows/sec1–1.2M rows/sec
Single-field scanBaseline1.2–1.5x slower3–5x slower
Wide projection (10+ fields)Baseline1.5–2x slower5–10x slower
Schema evolution frictionHigh (ALTER per field)None (auto-materialize)None
Storage overhead per rowLowest~10–20% higher~30–50% higher

Tuning levers for fast JSON ingest:

  • Batch size: 10k–100k rows per INSERT. Anything smaller starves the MergeTree merger and tanks throughput.
  • async_insert: server-side batching when the client cannot batch. Set async_insert=1 and wait_for_async_insert=0 for fire-and-forget; the server buffers and flushes in optimal batches.
  • input_format_parallel_parsing: default 1 in modern versions — confirm it is on for large rows. Single-threaded parsing of fat documents is the silent throughput killer on otherwise idle clusters.
  • Native TCP > HTTP: clickhouse-client uses the native protocol by default and parses faster than the HTTP interface for large payloads.
  • max_dynamic_paths: cap heterogeneous JSON to keep the part count bounded. The default works for most workloads; lower it when you see slow merges.

Streaming JSON from Kafka with Kafka engine

The native pattern for high-volume JSON ingestion is the Kafka engine. ClickHouse acts as a consumer, reads JSONEachRow-encoded messages from a topic, batches them into a materialized view, and writes to a MergeTree table. No external connector, no separate streaming runtime — the database is the consumer.

-- 1. Storage table (where the data actually lives)
CREATE TABLE events (
  ts          DateTime64(3),
  event_id    UInt64,
  user_id     UInt64,
  event_name  LowCardinality(String),
  payload     JSON
) ENGINE = MergeTree
ORDER BY (event_name, ts);

-- 2. Kafka engine table (the consumer)
CREATE TABLE events_kafka (
  ts          DateTime64(3),
  event_id    UInt64,
  user_id     UInt64,
  event_name  String,
  payload     JSON
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = 'broker1:9092,broker2:9092',
  kafka_topic_list = 'events',
  kafka_group_name = 'clickhouse-events-prod',
  kafka_format = 'JSONEachRow',
  kafka_num_consumers = 4,
  kafka_max_block_size = 65536,
  kafka_skip_broken_messages = 100,
  kafka_handle_error_mode = 'stream';

-- 3. Materialized view (wires consumer -> storage)
CREATE MATERIALIZED VIEW events_kafka_mv TO events AS
SELECT
  ts, event_id, user_id, event_name, payload
FROM events_kafka;

The engine accumulates messages until either kafka_max_block_size rows or stream_flush_interval_ms milliseconds pass, then flushes the batch in one bulk write — the same insert path that hits 1–2M rows/sec on local files.

Error handling: kafka_skip_broken_messages = N tolerates N malformed messages per block before failing. Set kafka_handle_error_mode = 'stream' to route broken messages to a virtual _error column instead of failing — wire a second materialized view to a dead-letter table that selects on length(_error) > 0 for inspection.

Offsets commit to Kafka only after the write to the storage table succeeds, so the guarantee is at-least-once delivery — a crashed consumer resumes from the last committed offset and may re-deliver the in-flight batch. For exactly-once semantics, use a deterministic event_id and the ReplacingMergeTree or a ReplicatedMergeTree with deduplication windows.

Key terms

JSONEachRow
ClickHouse's row-oriented JSON format: one JSON object per line, named keys. Wire-compatible with NDJSON and JSONL. The default choice for ingest when the source schema may shift.
JSON type
The production-ready dynamic column type introduced in 24.10. Stores semi-structured documents per row and materializes each path as a typed subcolumn. Supersedes the experimental Object('json') type.
Dynamic type
The underlying type used by JSON to hold values whose type varies across rows. Stores the value plus its actual type, and supports cast-at-query-time via .:TypeName suffix syntax.
clickhouse-local
An in-process build of the ClickHouse engine — no server, no daemon, no port. Runs the same SQL and the same table functions (file, url, s3) as a full server. The go-to tool for one-off JSON transformations and schema inference.
Kafka engine
A table engine that turns ClickHouse into a Kafka consumer. Paired with a materialized view, it streams messages from a topic into a MergeTree storage table with at-least-once delivery and automatic batching.
MergeTree
ClickHouse's primary storage engine family. Columnar, ordered by a sort key, with background merging of inserted parts. The destination table for nearly all JSON ingestion pipelines.
input_format_skip_unknown_fields
The insert-time setting that tells the parser to silently drop JSON fields that do not match a table column. Critical for surviving upstream schema additions without breaking ingestion.

Frequently asked questions

Does ClickHouse have a native JSON type in 2026?

Yes. The JSON column type went production-ready in ClickHouse 24.10 (October 2024) and reached GA across the 25.x line. Before 24.10 it shipped as an experimental Object(Nullable(Json)) type that you had to opt into with allow_experimental_object_type — that older type is now superseded and not recommended for new tables. The current JSON type stores semi-structured documents per row, automatically materializes paths into typed subcolumns, and supports dot-path access in SELECT queries. Each path is stored columnar under the hood, so queries that touch a single nested field read only that field from disk rather than parsing the whole document. You declare the column with the type JSON, optionally with hints like JSON(max_dynamic_paths=1024) to cap how many distinct paths get their own subcolumn before the rest fall back to a shared dynamic store. For workloads where shape varies row to row, the JSON type avoids the alternative of wide tables with thousands of mostly-null columns.

What's the difference between JSONEachRow and JSONCompactEachRow?

JSONEachRow expects one JSON object per line, where each object has named keys matching the table columns: {"id": 1, "name": "Alice"}. It is the same wire format as NDJSON or JSONL and is the default choice for ingesting documents where field order is not stable. JSONCompactEachRow expects one JSON array per line with values in positional column order: [1, "Alice"]. Compact form is roughly 30–40 percent smaller on the wire because it drops the field names, and it parses slightly faster because the parser does not look up column indices by name — but it requires that the producer and consumer agree on column order, which is brittle across schema migrations. Use JSONEachRow when documents come from external systems, logs, or anywhere the field set might shift; use JSONCompactEachRow when you control both ends and want maximum throughput. Both formats accept the input_format_skip_unknown_fields=1 setting to silently drop unrecognized keys.

How do I load a JSON file into ClickHouse via clickhouse-client?

Pipe the file through clickhouse-client with a query that ends in FORMAT JSONEachRow: cat events.jsonl | clickhouse-client --query="INSERT INTO events FORMAT JSONEachRow". The client streams the file to the server without buffering it all in memory, which means multi-gigabyte files load without RAM pressure. For gzipped inputs, pipe through gunzip first or use the --compression argument. For loading directly from a URL or S3, use the url() or s3() table functions inside the INSERT: INSERT INTO events SELECT * FROM s3('s3://bucket/events.jsonl', 'JSONEachRow'). To make ingestion forgiving when source documents have extra fields the table does not know about, add SETTINGS input_format_skip_unknown_fields=1 to the INSERT. For one-off transformations without spinning up a server, clickhouse-local reads the same formats with the same syntax against an embedded in-process engine.

Should I use the JSON type or split into typed columns?

Split into typed columns when the schema is stable, the field set is small (under a few hundred), and you know which fields drive the hot queries — explicit columns let ClickHouse build sort keys and skip indexes that the JSON type cannot match. Use the JSON type when documents vary row to row, the field set is large or unbounded, or you need to ingest now and figure out the schema later. A practical middle ground works well: declare the hot fields (user_id, event_name, timestamp) as explicit typed columns and put the rest of the payload in a single JSON column called raw. Queries on the hot path read typed columns directly; ad-hoc queries reach into raw via dot-path access. This pattern gives you the performance of explicit columns where it matters and the flexibility of dynamic JSON for everything else, without forcing you to choose one all-or-nothing approach at table creation time.

How do I query nested JSON paths in ClickHouse?

For a column declared as the JSON type, use dot-path syntax directly in SELECT: SELECT payload.user.id, payload.user.email FROM events. ClickHouse materializes each path as a subcolumn the first time it sees data for that path, so subsequent reads are pure columnar — only the bytes for that one path are loaded. For a column declared as String that happens to contain JSON, use JSON functions: JSONExtractString(payload, 'user', 'id'), JSONExtractInt(payload, 'count'), or simpleJSONExtractString(payload, 'user') for shallow keys. The JSON type is strictly faster than String + JSONExtract for repeated queries on the same paths because the work of parsing happens once at ingest time, not on every query. For semi-structured fields that vary in type across rows, the Dynamic type (used internally by JSON) is exposed directly — it stores the underlying value plus its type and lets you cast at query time.

Why is my JSON insert slower than expected?

The most common cause is small batches. ClickHouse is built for bulk inserts — a healthy JSONEachRow insert sustains 1–2 million rows per second on a single node, but only when each INSERT contains tens of thousands of rows. If your producer sends one row per INSERT, you will see throughput collapse to a few thousand per second because every insert creates a new MergeTree part that the background merger then has to consolidate. Fixes in order of effort: batch on the client side (buffer 10k+ rows before flushing), use the async_insert setting (server-side batching with wait_for_async_insert=0 for fire-and-forget), or front the cluster with a Kafka engine table that batches naturally. Other slowdowns: HTTP interface is slower than the native TCP protocol for large payloads, gzip compression on the wire helps if your CPU has headroom, and input_format_parallel_parsing=1 keeps the parser from being a single-thread bottleneck on large rows.

How do I stream JSON from Kafka into ClickHouse?

Use the Kafka engine. You create three things: a Kafka engine table that ClickHouse uses as a consumer, a regular MergeTree table that stores the data persistently, and a materialized view that reads from the Kafka table and writes to the MergeTree. The Kafka table is configured with the broker list, topic, consumer group, and the same format names used elsewhere (JSONEachRow is the typical choice). ClickHouse handles batching automatically — it accumulates messages until either kafka_max_block_size rows or kafka_poll_timeout_ms milliseconds pass, then flushes to the materialized view in one bulk write. This pattern sustains hundreds of thousands of messages per second per consumer and survives broker restarts without data loss because Kafka commits offsets only after the write to ClickHouse succeeds. For higher throughput, run multiple consumer threads with kafka_num_consumers, and for resilience set kafka_handle_error_mode='stream' so bad messages route to a side stream instead of stalling the pipeline.

Can ClickHouse infer a schema from a JSON sample?

Yes, in two ways. The first is DESC against a format function: DESC FORMAT('JSONEachRow', '{"id":1,"name":"Alice","ts":"2026-05-23T10:00:00Z"}') returns a column list with inferred types ready to paste into CREATE TABLE. The second is clickhouse-local, which runs the same engine in-process without a server: clickhouse-local --query="DESC TABLE file('sample.jsonl', 'JSONEachRow')" reads a sample file and prints the inferred schema. Both routes use the same inference logic — numeric fields become Float64 by default (UInt64/Int64 if you set input_format_try_infer_integers=1), date-like strings become DateTime64 if input_format_try_infer_dates=1, and missing fields become Nullable. For production tables, use inference as a starting point and then tighten types by hand: switch Float64 to UInt32 where you know the range, set LowCardinality(String) on enum-like fields, and decide which columns belong in the sort key.

Further reading and primary sources