JSON to BigQuery: Loading, Querying, and the Native JSON Type

Last updated:

BigQuery accepts JSON in one specific shape: NDJSON (newline-delimited JSON), one object per line, loaded with bq load --source_format=NEWLINE_DELIMITED_JSON or the equivalent Storage API call. A native JSON column type has been GA since October 2022 — it stores semi-structured data with the same columnar pruning as typed columns, so querying one field of a nested payload scans only that field, not the whole document. Choosing among JSON, STRING, and STRUCT is the main modeling decision: STRUCT wins on stable schemas, JSON wins on schema churn, and STRINGis rarely the right answer because queries on it scan and re-parse every byte. Below covers the load formats, when each storage type fits, schema autodetect's failure modes, the JSON query function family (JSON_VALUE, JSON_QUERY, the LAX casts), and how to avoid the 10× cost penalty of querying JSON stored as STRING.

Got a JSON file that bq load keeps rejecting? Paste it into Jsonic's JSON Validatorfirst — it pinpoints trailing commas, BOMs, and stray array wrappers with exact line numbers, and it'll flag if you have a JSON array where BigQuery wants NDJSON.

Validate JSON before loading

BigQuery JSON loading: NDJSON vs JSON array (NDJSON wins)

BigQuery does not load JSON arrays. The supported wire format is NDJSON — one complete JSON object per line, no enclosing [, no commas between records, no pretty-printing across lines. This is the same JSON Lines / NDJSON format used by most analytical warehouses, log aggregators, and ML training pipelines, because it streams record-by-record without parsing the whole file into memory.

{"id": 1, "user": {"email": "alice@example.com"}, "tags": ["beta", "paid"], "ts": "2026-05-23T10:00:00Z"}
{"id": 2, "user": {"email": "bob@example.com"},   "tags": ["free"],         "ts": "2026-05-23T10:00:01Z"}
{"id": 3, "user": {"email": "carol@example.com"}, "tags": ["beta"],         "ts": "2026-05-23T10:00:02Z"}

If your source is a JSON array, convert it before loading. The two reliable one-liners:

# With jq (handles arbitrary nesting):
jq -c '.[]' input.json > input.ndjson

# In Python (when records come from an API or generator):
import json
with open('input.ndjson', 'w') as f:
    for record in records:
        f.write(json.dumps(record) + '\n')

The 20 MB row size limit applies per line, not per file — a 200 GB NDJSON file with average 1 KB rows loads fine. Files over 4 GB must come from Cloud Storage; the bq loadCLI's direct-upload path caps at 4 GB. For pipelines, write NDJSON to GCS and trigger a load job — that path scales horizontally and is what the Storage Write API uses internally for batch ingestion.

Native JSON type vs STRING vs STRUCT: when to use each

BigQuery offers three ways to store JSON, and the choice drives both query cost and schema evolution flexibility. The trade-off is not subtle: pick wrong and you pay 10× the bytes on every query, or you ALTER TABLE every time a producer adds a field.

TypeStorageQuery cost (one field)Schema evolutionBest for
STRUCTColumnar per fieldLowest — only the field scannedALTER TABLE ADD COLUMN; no removesStable analytics events, dimensional tables
JSONColumnar per leaf path (shredded)Near-STRUCT — only the path scannedFree — new keys appear automaticallyWebhooks, third-party payloads, IoT telemetry
STRINGOpaque bytes~10× higher — whole column scanned, re-parsed per rowFree, but no enforcementArchival, byte-exact preservation, audit logs

STRUCT is the right default when the producer is under your control and the contract is stable. You get type checking at write time, the cleanest query syntax (event.user.email rather than a function call), and the smallest storage footprint because BigQuery stores each field in its own typed columnar block. The cost is migrations: every new producer field requires ALTER TABLE, and you cannot remove fields once added.

The native JSON type, GA since October 2022, gives you schema flexibility without the cost. Internally BigQuery shreds the JSON into typed columns per leaf path — so SELECT JSON_VALUE(payload, '$.user.email') reads only the payload.user.email column, not the whole document. Repeated keys are dictionary-encoded, and the format degrades gracefully when types vary across rows.

STRING is almost never the right choice for queryable data. The bytes are opaque to the engine: every query that references the column scans all of it and re-parses each row inside JSON_VALUE. Reserve STRING for cold archival of original payloads where you need byte-exact preservation (compliance, signatures), and pair it with a JSON column carrying the queryable projection.

Schema autodetect: when it works and when it fails

bq load --autodetect samples up to the first 100 rows (capped at 100 MB) and infers a schema: types per field, nullability, and nested STRUCT shapes for objects. It is convenient for one-off exploration and dangerous for production.

bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --autodetect \
  --max_bad_records=10 \
  my_dataset.exploratory_table \
  gs://my-bucket/sample.ndjson

Autodetect works well when:

  • Every row has the same field set (no optional or sparse keys)
  • Each field has one consistent type across the sample and the rest of the file
  • Arrays contain elements of one type only (no mixed [1, "two"])
  • The data is small enough that 100 rows is representative

The failure modes are predictable. Integer overflow: the sample shows values that fit in INT64, then row 500,000 has a value larger than 2^63-1 and the load fails. Null-only sample columns: a field that is null in the first 100 rows gets typed as STRING; later rows with numbers or booleans then either get rejected or coerced to strings, breaking downstream queries. Sparse fields: a field present in 30% of rows might not appear in the sample at all and gets dropped from the schema, silently discarding data.

For production loads, write a schema.json by hand or generate it from a type definition. The file is a flat array of column descriptors with name, type, mode, and optional fields for STRUCT. For variable-shape data, declare one JSON column and skip the type inference entirely — the JSON type has no schema to mis-infer.

Querying JSON: JSON_VALUE, JSON_QUERY, JSON_EXTRACT

BigQuery has two generations of JSON functions. The modern family — JSON_VALUE, JSON_QUERY, JSON_VALUE_ARRAY, JSON_QUERY_ARRAY, and the LAX_* casts — follows the SQL/JSON standard. The legacy family — JSON_EXTRACT, JSON_EXTRACT_SCALAR — predates the JSON type and remains for backward compatibility. New code should use the modern names.

-- Scalar extraction: returns STRING (or NULL if path missing)
SELECT
  id,
  JSON_VALUE(payload, '$.user.email')       AS email,
  JSON_VALUE(payload, '$.user.profile.age') AS age_str,
  SAFE_CAST(JSON_VALUE(payload, '$.user.profile.age') AS INT64) AS age
FROM my_dataset.events
WHERE JSON_VALUE(payload, '$.user.country') = 'US';

-- Sub-JSON extraction: returns JSON, keep navigating
SELECT
  id,
  JSON_QUERY(payload, '$.user.addresses') AS addresses_json
FROM my_dataset.events;

-- Type-coercion casts that tolerate inconsistent input
SELECT
  LAX_BOOL(JSON_VALUE(payload, '$.is_active')) AS is_active,
  LAX_INT64(JSON_VALUE(payload, '$.count'))    AS count,
  LAX_STRING(JSON_VALUE(payload, '$.label'))   AS label
FROM my_dataset.events;

JSON_VALUE returns a STRING; cast it with SAFE_CAST when you need a number, or use LAX_INT64 / LAX_BOOL / LAX_STRING when the source type varies across rows (a field that is sometimes "42", sometimes 42). The LAX functions attempt sensible coercions and return NULL when no coercion fits, instead of failing the query.

For arrays, use JSON_QUERY_ARRAY with UNNEST to flatten elements into rows:

SELECT
  e.id,
  JSON_VALUE(tag, '$.name')  AS tag_name,
  JSON_VALUE(tag, '$.value') AS tag_value
FROM my_dataset.events e,
UNNEST(JSON_QUERY_ARRAY(e.payload, '$.tags')) AS tag
WHERE JSON_VALUE(tag, '$.name') = 'experiment';

JSONPath in BigQuery supports dot notation ($.user.email), bracket notation for keys with special characters ($["weird key"]), and numeric array indexing ($.items[0]). It does not support filter expressions, slice notation, or recursive descent — push those into SQL WHERE clauses on the extracted values. See JSON database queries for the cross-engine comparison of JSONPath dialects.

Cost optimization: querying STRING JSON costs 10× more than typed columns

BigQuery bills on bytes scanned. Storage type determines how many bytes a query has to read, and the gap between STRING JSON and JSON/STRUCT is order-of-magnitude. A concrete example: a 1 TB events table with a payload column averaging 4 KB per row.

QuerySTRING payloadJSON payloadSTRUCT payload
SELECT JSON_VALUE(payload, '$.user.email')~1 TB scanned~25 GB scanned (email column only)~25 GB scanned (email column only)
SELECT payload.user.email (STRUCT only)n/a — must use JSON_VALUEn/a~25 GB scanned
Cost at $6.25 / TB on-demand$6.25$0.16$0.16

The 40× gap is not a typo. Storing JSON as STRING forces BigQuery to scan and parse every byte of the column for every query, because the bytes are opaque to the engine's columnar pruning. The JSON type shreds the document into typed columns per leaf path during ingest, so a query that reads one field reads exactly one shredded column.

Migration math: if you have a STRING column storing JSON and your workload runs even a few queries per day, the rewrite to the JSON type pays for itself in days, not months. The migration is a single CREATE TABLE AS SELECT with SAFE.PARSE_JSON, plus a rename.

-- One-shot migration from STRING to JSON
CREATE TABLE my_dataset.events_v2
PARTITION BY DATE(ts)
CLUSTER BY user_id
AS
SELECT
  * EXCEPT(payload),
  SAFE.PARSE_JSON(payload) AS payload
FROM my_dataset.events;

-- Verify, then swap
DROP TABLE my_dataset.events;
ALTER TABLE my_dataset.events_v2 RENAME TO events;

For warehouse comparisons, see Snowflake comparison, Redshift comparison, and ClickHouse comparison. The cost dynamics differ — Snowflake's VARIANT pre-shreds similarly, Redshift SUPER charges differently — but the principle holds: opaque text storage costs more than typed-column storage for any non-trivial query workload.

Streaming inserts with JSON payloads

For real-time ingestion, the Storage Write API replaces the older tabledata.insertAll streaming endpoint — it is cheaper (free for the first 2 TB/month, then $0.025/GB vs $0.05/GB for legacy streaming), supports exactly-once semantics with stream offsets, and accepts JSON payloads directly. The Python client wraps it as BigQueryWriteClient with protobuf serialization under the hood; for simple ingestion, the higher-level insert_rows_json helper still works and handles the protobuf details for you.

# pip install google-cloud-bigquery
from google.cloud import bigquery

client = bigquery.Client()
table_id = "my-project.my_dataset.events"

rows = [
    {"id": 1, "payload": {"user": {"email": "alice@example.com"}, "tags": ["beta"]}},
    {"id": 2, "payload": {"user": {"email": "bob@example.com"},   "tags": ["free"]}},
]

errors = client.insert_rows_json(table_id, rows)
if errors:
    print(f"Insert errors: {errors}")
else:
    print(f"Inserted {len(rows)} rows")

For a payload JSON column, pass nested Python dicts directly — the client serializes them to the JSON column format. For STRUCT columns, the dict keys must match the schema field names exactly, and missing required fields fail the insert.

Latency: streamed rows are queryable within seconds, but they sit in a streaming buffer for up to 90 minutes before being moved to permanent partition storage. DML operations (UPDATE, DELETE) cannot touch rows still in the buffer. For pipelines that need DML, batch the writes into 1–5 minute windows and use bq load on small NDJSON files instead — same cost profile, immediate DML support.

Loading from Cloud Storage with bq load

For batch ingestion above a few hundred MB, stage to Cloud Storage and run a load job. GCS-sourced loads parallelize across slots automatically, support files up to 5 TB, and the load itself is free (you pay only GCS storage and any cross-region egress).

# Upload to GCS in the same region as your dataset
gsutil cp events.ndjson gs://my-loads/2026/05/23/events.ndjson

# Run the load with an explicit schema and partitioning
bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --max_bad_records=10 \
  --time_partitioning_field=ts \
  --time_partitioning_type=DAY \
  --clustering_fields=user_id,event_type \
  --replace=false \
  my_dataset.events \
  gs://my-loads/2026/05/23/events.ndjson \
  schema.json

# Wildcard for many files in one job:
bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  my_dataset.events \
  'gs://my-loads/2026/05/23/*.ndjson' \
  schema.json

Key flags:

  • --max_bad_records=N — tolerate up to N malformed rows before failing the job. Default is 0 (one bad row aborts the load).
  • --replace=false (default is true with no --noreplace) — append rather than overwrite. For partitioned tables, the WRITE_APPEND pattern is the safe default.
  • --time_partitioning_field — partition the destination on a TIMESTAMP/DATE column. Required for partitioned tables on first load.
  • --clustering_fields — comma-separated list of up to four columns to cluster by. Improves filter performance for queries that include those columns in WHERE.
  • --ignore_unknown_values — silently drop fields not declared in the schema. Useful when producers add fields you don't care about yet.

Wildcards in the source URI expand to all matching files in one load job, which is both faster and cheaper than serial loads. For daily pipelines, write to a date-partitioned GCS prefix and load with a date-stamped wildcard.

Exporting query results as JSON for downstream pipelines

The reverse direction — getting JSON out of BigQuery — has two paths: bq extract for full-table or query-result exports to GCS, and bq query --format=json for small result sets straight to stdout. Both produce NDJSON by default, which is exactly what downstream tools (jq, Spark, streaming consumers) expect.

# Export a full table to NDJSON on GCS
bq extract \
  --destination_format=NEWLINE_DELIMITED_JSON \
  my_dataset.events \
  'gs://my-exports/events/events-*.ndjson'

# Stream a query result as JSON to stdout
bq query --use_legacy_sql=false --format=json --max_rows=1000 \
  'SELECT id, JSON_VALUE(payload, "$.user.email") AS email FROM my_dataset.events LIMIT 1000'

A small bq query --format=json output looks like a JSON array of objects (because the CLI buffers and prints), not NDJSON. For NDJSON output from a query, use the API directly or pipe through jq:

bq query --use_legacy_sql=false --format=json \
  'SELECT id, payload FROM my_dataset.events LIMIT 100' \
  | jq -c '.[]'

For JSON columns in the result, BigQuery emits the JSON value as a JSON literal (not a string), so the downstream consumer gets nested structures directly without having to JSON.parse. For STRUCT columns, the export nests them as JSON objects matching the schema shape. STRING columns containing JSON come out as escaped strings — parse them on the consumer side or migrate to the JSON type to avoid the double-encoding.

Key terms

NDJSON
Newline-delimited JSON — one complete JSON object per line, no enclosing array, no inter-record commas. The only JSON shape BigQuery's loader accepts. Same format as JSON Lines (.jsonl).
JSON type
BigQuery's native semi-structured column type, GA since October 2022. Stores JSON values with columnar shredding per leaf path, so queries that read one field scan only that field. Max value size 20 MB.
JSON_VALUE
SQL/JSON function that extracts a scalar from a JSON value using a JSONPath expression. Returns STRING (or NULL when the path is missing). Cast the result with SAFE_CAST or the LAX functions when you need a typed value.
JSON_QUERY
SQL/JSON function that extracts a sub-document (object, array, or scalar) from a JSON value as a JSON value, so you can keep navigating. Use this when the target is not a scalar or when you want to keep the JSON shape for further processing.
schema autodetect
The --autodetect flag for bq load that samples the first 100 rows (up to 100 MB) and infers types and nullability. Suitable for exploration; unreliable for production because integer overflow, null-only sample columns, and sparse fields all silently break.
Storage Write API
BigQuery's modern streaming ingest API — cheaper than legacy insertAll, supports exactly-once with stream offsets, and accepts JSON payloads. Rows become queryable within seconds but sit in a streaming buffer for up to 90 minutes before DML can touch them.

Frequently asked questions

How do I load a JSON array into BigQuery?

BigQuery does not load JSON arrays directly — the bq load command and the storage API both expect NEWLINE_DELIMITED_JSON (NDJSON), one JSON object per line with no enclosing array brackets and no commas between records. The standard workflow is to convert your array into NDJSON first: in Python, json.dumps(record) inside a loop writing to a .ndjson file; with jq, jq -c ".[]" input.json > output.ndjson. Then run bq load --source_format=NEWLINE_DELIMITED_JSON dataset.table gs://bucket/file.ndjson schema.json. The 20 MB row size limit applies per line, not per file — large arrays split cleanly across many lines. If you must keep the array shape, load the whole file into a single JSON column with a STRING source and parse it with PARSE_JSON in a subsequent query, but this wastes storage and query bytes because BigQuery cannot prune individual records.

What's the difference between JSON, STRING, and STRUCT for JSON data?

STRING stores the raw JSON text — BigQuery treats it as opaque bytes, every query that touches a field scans the entire column, and JSON_VALUE has to re-parse on each row. STRUCT (with nested and repeated fields) gives you a fixed schema: BigQuery stores each field in its own columnar block, queries that read one field scan only that field, and the type system catches errors at write time. The native JSON type (GA since October 2022) is the middle ground — schema-flexible like STRING but columnar like STRUCT. Internally BigQuery shreds the JSON into typed columns per leaf path, so SELECT JSON_VALUE(payload, "$.user.email") scans only the email column, not the whole payload. Use STRUCT when the schema is stable, JSON when fields vary across rows or change over time, and STRING only for archival or when you need byte-exact preservation of the original document.

Why does my JSON load fail with 'Error while reading data'?

The most common cause is loading a JSON array instead of NDJSON. bq load reports "Error while reading data, error message: JSON parsing error in row starting at position 0" when the first byte is [ rather than {. Convert with jq -c ".[]" file.json > file.ndjson and retry. Other frequent causes: a UTF-8 BOM at the start of the file (strip with sed -i "1s/^\xEF\xBB\xBF//"), unescaped newlines inside string values (each JSON record must be on exactly one physical line), and trailing commas — strict JSON forbids them and BigQuery rejects the row. If the schema is declared, a type mismatch like a number in a STRING field or a null in a REQUIRED field also fails the load. Run bq load with --max_bad_records=10 to surface multiple errors at once instead of stopping at the first one, then inspect the error stream in the job metadata.

How do I query a nested JSON field in BigQuery?

For scalar values use JSON_VALUE with a JSONPath expression: SELECT JSON_VALUE(payload, "$.user.profile.email") FROM events. JSON_VALUE returns a STRING (or NULL if the path does not exist), so cast it with SAFE_CAST or the LAX functions when you need a number or boolean. For sub-objects or arrays use JSON_QUERY, which returns a JSON value you can keep navigating: SELECT JSON_QUERY(payload, "$.user.addresses") FROM events. To explode a JSON array into rows, combine JSON_QUERY_ARRAY with UNNEST: SELECT addr FROM events, UNNEST(JSON_QUERY_ARRAY(payload, "$.addresses")) AS addr. JSONPath in BigQuery supports dot notation, bracket notation for keys with special characters ($["weird key"]), and array index ($.items[0]); it does not support filter expressions or wildcards beyond simple traversal, so complex selections still belong in WHERE clauses on extracted values.

Should I use the native JSON type or STRUCT?

Use STRUCT when the schema is stable, every row has the same fields, and you control the producer — analytics events with a fixed contract, dimensional tables, anything modeled upstream. STRUCT gives you compile-time type checking, the cleanest query syntax (event.user.email, not JSON_VALUE), and the smallest storage footprint. Use the native JSON type when fields vary by row, the producer adds new keys over time without coordination, or you ingest from many sources with different shapes — webhook payloads, third-party events, IoT telemetry with mixed device firmware. JSON columns store each observed leaf path as its own typed column under the hood, so query cost is comparable to STRUCT when you read a small subset of fields, and storage is competitive because repeated keys are dictionary-encoded. The deciding factor is schema churn, not data volume. If you only add fields and never remove them, STRUCT plus ALTER TABLE ADD COLUMN works fine; if shapes mutate freely, JSON saves the migration overhead.

How do I handle JSON arrays in BigQuery queries?

Three patterns cover most cases. To extract an array as JSON for further navigation, use JSON_QUERY: SELECT JSON_QUERY(payload, "$.tags") tags FROM events. To get a SQL ARRAY of typed values, use JSON_VALUE_ARRAY (scalars) or JSON_QUERY_ARRAY (sub-JSON): SELECT JSON_VALUE_ARRAY(payload, "$.tags") tags FROM events returns ARRAY<STRING>. To flatten an array into rows for joins or aggregation, combine JSON_QUERY_ARRAY with UNNEST: SELECT e.id, tag FROM events e, UNNEST(JSON_QUERY_ARRAY(e.payload, "$.tags")) AS tag. For arrays of objects, UNNEST gives you each element as a JSON value, and you keep extracting fields with JSON_VALUE inside the SELECT. Watch the cost: UNNEST multiplies row count, so a 1M-row table with an average of 5 tags becomes 5M rows after flattening — fine for analysis, expensive if you forget a WHERE clause on the outer table.

Can BigQuery autodetect a schema from JSON?

Yes, with caveats. Pass --autodetect to bq load and BigQuery samples the first 100 rows (up to 100 MB) to infer types and nullability. It works well for flat, homogeneous data where every row has the same fields and the type of each field is unambiguous. It fails in three common ways: integers that exceed INT64 in some rows get inferred as INT64 from the sample and then the actual load row rejects them; fields that are null in every sampled row are inferred as STRING and downstream typed code breaks; and arrays of mixed scalar types (e.g., [1, "two", 3]) get rejected entirely because BigQuery has no union types. For production loads, write an explicit schema.json file — it is more work once and removes the entire class of "worked in dev, failed in prod" surprises. Autodetect is fine for one-off exploratory loads where you will discard the table.

How do I cast a STRING column to JSON in an existing table?

BigQuery does not let you change a STRING column to JSON in place — column types are immutable. The standard migration is a rewrite: CREATE TABLE dataset.events_v2 AS SELECT * EXCEPT(payload), SAFE.PARSE_JSON(payload) AS payload FROM dataset.events. Use SAFE.PARSE_JSON instead of PARSE_JSON so malformed rows return NULL rather than failing the whole job — log them and decide whether to drop or fix. After verifying row counts and a few sample queries, drop the old table and rename the new one. If the table is partitioned or clustered, declare the partition/cluster spec on the CREATE TABLE; otherwise you lose the layout. For very large tables, do the rewrite partition by partition with MERGE statements to keep job sizes manageable and to preserve the partition-level access controls. The opposite direction (JSON to STRING) is a simple CAST: TO_JSON_STRING(payload).

Further reading and primary sources