JSON to Parquet: Convert with Python (pandas & PyArrow)

Converting JSON to Parquet with Python reduces file size by 5–10× and query speed by 10–100× for analytical workloads — Parquet's columnar storage and dictionary encoding compress repetitive string values far better than JSON's verbose text format. The fastest conversion path is pandas: pd.read_json('data.json').to_parquet('data.parquet', compression='snappy') — Snappy compression gives 3–5× size reduction with fast decompression. For files larger than available RAM, use PyArrow's streaming writer with batch sizes of 100,000 rows to keep memory constant. This guide covers single-file conversion with pandas, streaming large JSON files with PyArrow, nested JSON flattening before conversion, schema inference vs explicit schema, compression options (Snappy, gzip, zstd), and reading Parquet back to JSON. Before converting, use Jsonic to validate and inspect your JSON to catch structural issues early.

Need to inspect or format your JSON before converting to Parquet? Jsonic's JSON formatter validates and prettifies any JSON instantly.

Open JSON Formatter

Why convert JSON to Parquet — 5 numbers that matter

Parquet is a columnar binary format designed for analytical workloads. JSON is a row-oriented text format designed for data interchange. The performance difference is dramatic for the right workloads: a 1 GB JSON file storing 10 million records with 20 columns becomes roughly 80–200 MB as Parquet with Snappy, and a query that reads 2 of those 20 columns only scans 10% of the Parquet file — the other 18 columns are never touched. Here are the 5 numbers to know before choosing a format:

Format1 GB JSON equivalentColumn scan (2 of 20 cols)Best for
JSON1 GBFull file (1 GB)APIs, human reading, streaming records
Parquet + Snappy100–200 MB10–20 MB (columnar skip)Analytics, Spark, DuckDB, Athena
Parquet + Gzip80–130 MB10–20 MB (2× slower decomp)Cold storage, archival
Parquet + Zstd90–150 MB10–20 MB (near-Snappy speed)Balance of size and speed
JSONL (uncompressed)~1 GBFull fileStreaming ingestion, append-only logs

Dictionary encoding is why Parquet is so effective on string columns: if a country column repeats "US" across 9 million of 10 million rows, Parquet stores the string once in a dictionary and uses a 1-byte integer index for each row. JSON repeats the full string 9 million times. For high-cardinality columns (UUIDs, free-text), the gain is smaller — Parquet still wins on columnar skip but not on dictionary encoding. See the JSONL format guide for the append-only streaming alternative that keeps the JSON text format.

Convert JSON to Parquet with pandas (single file, fits in RAM)

pandas is the fastest path for files under a few hundred MB. It reads JSON into a DataFrame, infers column types, and delegates Parquet writing to PyArrow under the hood. Install dependencies once:

pip install pandas pyarrow

For a standard JSON array file (an array of objects at the top level):

import pandas as pd

# Read JSON array → DataFrame → Parquet with Snappy compression
df = pd.read_json('data.json')
df.to_parquet('data.parquet', compression='snappy')

print(f"Rows: {len(df):,}, Columns: {len(df.columns)}")
print(df.dtypes)  # verify inferred types

For a JSON Lines file (.jsonl — one JSON object per line), add lines=True:

df = pd.read_json('data.jsonl', lines=True)
df.to_parquet('data.parquet', compression='snappy')

To write multiple row groups (useful for large DataFrames and better query pushdown), set row_group_size:

df.to_parquet(
    'data.parquet',
    compression='snappy',
    row_group_size=500_000,   # 500k rows per row group
    engine='pyarrow',
)

Row groups allow query engines to skip entire row groups using Parquet's min/max statistics — if you query WHERE date > '2024-01-01' and a row group's max date is before that threshold, the engine skips the entire group without decompressing it. Sorting your DataFrame by the most common filter column before writing dramatically improves this skipping. To convert the result back to a pandas DataFrame, use pd.read_parquet('data.parquet').

Stream large JSON files to Parquet with PyArrow (out-of-core conversion)

For JSON Lines files larger than available RAM, use PyArrow's ParquetWriter with batched reads. Peak memory stays at roughly BATCH_SIZE × average_row_bytes regardless of total file size — for 100,000 rows at 500 bytes each, that is 50 MB. Install PyArrow:

pip install pyarrow
import json
import pyarrow as pa
import pyarrow.parquet as pq

BATCH_SIZE = 100_000   # rows per write batch — tune to available RAM
INPUT_FILE = 'large_data.jsonl'
OUTPUT_FILE = 'large_data.parquet'

writer = None
batch = []

with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    for line in f:
        batch.append(json.loads(line))

        if len(batch) >= BATCH_SIZE:
            table = pa.Table.from_pylist(batch)
            if writer is None:
                writer = pq.ParquetWriter(OUTPUT_FILE, table.schema, compression='snappy')
            writer.write_table(table)
            batch.clear()

    # flush remaining rows
    if batch:
        table = pa.Table.from_pylist(batch)
        if writer is None:
            writer = pq.ParquetWriter(OUTPUT_FILE, table.schema, compression='snappy')
        writer.write_table(table)

if writer:
    writer.close()   # writes Parquet footer — required for valid file

print(f"Written to {OUTPUT_FILE}")

The first batch determines the schema — if later batches have different types on the same column (e.g., sometimes an integer, sometimes a string), PyArrow raises aArrowInvalid error. Prevent this with explicit schema inference on a sample before writing (see Section 5), or cast inconsistent columns to string in the batch processing loop. For regular JSON arrays (not JSONL), use ijson for streaming:

pip install ijson
import ijson

with open('large_array.json', 'rb') as f:
    # ijson.items yields each top-level array element lazily
    for item in ijson.items(f, 'item'):
        batch.append(item)
        if len(batch) >= BATCH_SIZE:
            # ... same flush logic as above

Batch sizes between 100,000 and 500,000 rows balance memory pressure and write throughput — smaller batches create more row groups (better pushdown) but higher per-batch overhead; larger batches reduce overhead but increase peak memory. For a deeper look at reading JSON into Python data structures, see parsing JSON in Python.

Flatten nested JSON before converting to Parquet

Parquet supports nested structures (structs and lists), but most SQL query engines work best with flat schemas. pandas.json_normalize() flattens nested objects into dot-separated column names in 1 line — 10× faster than writing a recursive flattening function from scratch. A nested record like {"user": {"id": 1, "name": "Alice"}, "event": "click"} becomes columns user_id, user_name, event with sep='_'.

import json
import pandas as pd
from pandas import json_normalize

# Load JSON array
with open('nested.json') as f:
    records = json.load(f)

# Flatten nested objects — sep="_" avoids dot in column names
df = json_normalize(records, sep='_')

print(df.columns.tolist())
# ['user_id', 'user_name', 'address_city', 'address_zip', 'event']

df.to_parquet('flat.parquet', compression='snappy')

For nested arrays (a field that is a list of objects), use record_path to explode the array into rows and meta to keep parent fields as columns:

# Each record: {"order_id": 1, "items": [{"sku": "A", "qty": 2}, ...]}
df = json_normalize(
    records,
    record_path='items',         # explode this list into rows
    meta=['order_id', 'date'],   # keep these parent fields per row
    sep='_',
)
# Columns: sku, qty, order_id, date
df.to_parquet('orders_exploded.parquet', compression='snappy')

Fields with mixed types (a column that is sometimes a string, sometimes a dict) must be normalized before Parquet conversion. Cast them with df['field'] = df['field'].astype(str) or use a pre-processing step to serialize the dicts to JSON strings. See the flatten JSON Python guide for detailed strategies covering deep nesting, null handling, and performance at scale.

Schema inference vs explicit PyArrow schema

pandas and PyArrow both infer column types automatically from the data. Inference works well for clean, consistent JSON but fails silently on mixed-type columns (e.g., a field that is sometimes "123" as a string and sometimes 123 as an integer) — pandas may cast the column to object dtype, which Parquet stores as a binary blob rather than a typed column, losing query performance. Explicit schema prevents this at the cost of more upfront code.

import pyarrow as pa
import pyarrow.parquet as pq

# Define explicit schema — all types declared upfront
schema = pa.schema([
    pa.field('user_id',    pa.int64()),
    pa.field('name',       pa.string()),
    pa.field('score',      pa.float64()),
    pa.field('active',     pa.bool_()),
    pa.field('created_at', pa.timestamp('ms')),
    pa.field('tags',       pa.list_(pa.string())),
])

# Convert records using explicit schema
import pandas as pd
df = pd.read_json('data.json')

# Cast DataFrame to match schema before writing
table = pa.Table.from_pandas(df, schema=schema)
pq.write_table(table, 'typed.parquet', compression='snappy')

PyArrow raises ArrowInvalid immediately if a column value cannot be cast to the declared type — this surfaces data quality issues at write time rather than at query time. Common type pitfalls from JSON: integers read as float64 when any value in the column is null (pandas promotes to nullable float); ISO 8601 date strings not automatically parsed as timestamps (pass convert_dates=['col'] to pd.read_json() or cast with pd.to_datetime() before converting to PyArrow). To inspect your JSON structure and types before writing conversion code, use Python JSON parsing to load a sample and check field types interactively.

Parquet compression options: Snappy, Gzip, Zstd, LZ4

Parquet compression is applied per column chunk, not per file, which means different columns can have different compression codecs in theory — but in practice all columns use the same codec set at write time. The compression parameter in pandas/PyArrow accepts 'snappy', 'gzip', 'zstd', 'lz4', or 'none'.

CodecSize vs JSONRead speedWrite speedUse case
Snappy3–5× smallerFastestFastDefault — hot/warm analytical data
Gzip5–8× smaller2× slower than Snappy3× slowerCold storage, S3 cost reduction
Zstd4–7× smallerNear SnappyNear GzipBest balance of size and speed
LZ42–4× smallerFastest (incl. Snappy)FastestIntermediate pipeline files
None2–4× smaller (encoding only)Fastest readsFastest writesBenchmarking, very fast SSDs
# Snappy (default, recommended)
df.to_parquet('data_snappy.parquet', compression='snappy')

# Gzip — better compression, slower reads
df.to_parquet('data_gzip.parquet', compression='gzip')

# Zstd — tune compression level 1-22 (default 3)
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df)
pq.write_table(table, 'data_zstd.parquet',
               compression='zstd',
               compression_level=9)   # higher = smaller, slower

For AWS Athena and S3-hosted Parquet, Snappy is the documented recommendation because Athena charges per byte scanned — Snappy's smaller file size reduces cost, and its fast decompression means query CPU time stays low. For Google BigQuery external tables, Snappy or Gzip are both supported; Snappy is faster to query. Zstd support was added in PyArrow 1.0 and is available in all modern Spark/Databricks versions — use Zstd level 3 as a drop-in Snappy replacement when you need 20–30% better compression with no meaningful read speed penalty.

Read Parquet back to JSON in Python

After converting JSON to Parquet, you will often need to read it back — for API responses, debugging, or passing to a downstream JSON consumer. pandas and PyArrow both support this in 2–3 lines. The key parameter is orient in df.to_json(), which controls the output shape.

import pandas as pd

# Read Parquet → DataFrame
df = pd.read_parquet('data.parquet')

# Write to JSON array of objects (most common)
df.to_json('output.json', orient='records', indent=2)

# Write to JSON Lines (one object per line)
df.to_json('output.jsonl', orient='records', lines=True)

# Read specific columns only (faster — columnar skip)
df_cols = pd.read_parquet('data.parquet', columns=['user_id', 'name', 'score'])
print(df_cols.to_json(orient='records'))

With PyArrow directly — useful when you want to avoid pandas overhead for large files:

import pyarrow.parquet as pq
import json

# Read full table
table = pq.read_table('data.parquet')

# Convert to list of dicts and serialize
records = table.to_pylist()               # list of dicts
with open('output.json', 'w') as f:
    json.dump(records, f, indent=2)

# Stream row groups to avoid loading all into memory
parquet_file = pq.ParquetFile('large_data.parquet')
with open('output.jsonl', 'w') as out:
    for batch in parquet_file.iter_batches(batch_size=100_000):
        for row in batch.to_pylist():
            out.write(json.dumps(row) + '\n')

When reading back, pandas maps Parquet types to pandas dtypes: int64 stays int64, string becomes object, timestamp[ms] becomes datetime64[ms]. Timestamps are serialized as epoch milliseconds by to_json() by default — pass date_format='iso' to get ISO 8601 strings. See the JSON to DataFrame guide for more DataFrame ↔ JSON patterns.

Frequently asked questions

How do I convert a JSON file to Parquet with Python?

The fastest path for files that fit in RAM is 2 lines of pandas: import pandas as pd, then pd.read_json('data.json').to_parquet('data.parquet', compression='snappy'). pandas reads the JSON into a DataFrame, infers column types, and writes Parquet via PyArrow under the hood. Install with: pip install pandas pyarrow. For JSON Lines files (.jsonl — one object per line), use pd.read_json('data.jsonl', lines=True).to_parquet('data.parquet'). For files larger than available RAM, use PyArrow's streaming writer with batched reads — read the JSON in chunks of 100,000 rows, convert each chunk to a PyArrow Table with pa.Table.from_pylist(batch), and write incrementally with pq.ParquetWriter. This keeps memory usage constant regardless of input size. Always validate your JSON first using Jsonic's JSON formatter to ensure the file is well-formed before passing it to pandas.

What is the best compression for JSON to Parquet conversion?

Snappy is the default and best choice for most use cases: it gives 3–5× size reduction over raw JSON and decompresses faster than any other Parquet codec, which matters for query throughput in Spark, Athena, and DuckDB. Gzip compresses 5–8× smaller than JSON but reads 2× slower than Snappy — use it for cold storage or archival data that is queried rarely. Zstd is a strong middle ground: compression ratio close to Gzip with decompression speed close to Snappy; it is the best choice when both storage cost and query speed matter. Set Zstd level 3 (default) for Snappy-level speed with 20–30% better compression. LZ4 is the fastest to compress and decompress — use it for intermediate pipeline files that will be rewritten. Uncompressed Parquet is still 2–4× smaller than equivalent JSON due to columnar and dictionary encoding alone, even without a compression codec — so any Parquet is better than JSON for storage efficiency on analytical data.

How do I convert a large JSON file to Parquet without loading it all into memory?

Use PyArrow's ParquetWriter with batched reads. For a JSON Lines file, read it line by line using Python's built-in json module, accumulate rows into a list, and flush to Parquet every 100,000 rows. When the batch reaches the target size, call pa.Table.from_pylist(batch) to build a PyArrow Table, pass it to writer.write_table(table), then clear the batch. Open the writer on the first flush using the schema from the first batch. Close the writer after the final flush — this writes the Parquet file footer which is required for any query engine to read the file. Peak memory is roughly batch_size × avg_row_bytes: 100,000 rows at 500 bytes each uses about 50 MB. For regular JSON arrays (not JSONL), use the ijson library (pip install ijson) which yields array elements one at a time from an array without loading the full file: ijson.items(f, 'item') streams each top-level array element lazily. Batch sizes between 100,000 and 500,000 rows balance memory pressure and Parquet row group efficiency.

How do I handle nested JSON when converting to Parquet?

Parquet supports nested structures natively (structs and lists map to Parquet group and repeated fields), so PyArrow can write nested JSON as-is. However, most SQL query engines work best with flat schemas. Use pandas.json_normalize(records, sep='_') to flatten nested objects into columns with underscore-separated names — for example, {"address": {"city": "Austin"}} becomes a column named address_city. For nested arrays (a field containing a list of objects), use the record_path parameter to explode the array into one row per element and meta to preserve parent fields as columns on each row. Fields with mixed types (sometimes a string, sometimes a dict) must be normalized before Parquet conversion — cast them to string with df['field'] = df['field'].astype(str) or serialize dicts to JSON strings with df['field'].apply(json.dumps). See the flatten JSON Python guide for complete strategies covering deep nesting, null handling, and large-scale performance.

How do I read a Parquet file back to JSON in Python?

With pandas, reading back is 2 lines: df = pd.read_parquet('data.parquet'), then df.to_json('output.json', orient='records', indent=2) writes a JSON array of objects. The orient parameter controls shape: 'records' gives [{"col": "val"}, ...] (most common), 'split' gives {"columns": [...], "data": [...]}. For JSON Lines output, pass lines=True: df.to_json('output.jsonl', orient='records', lines=True). For selective column reads (much faster on wide tables), use pd.read_parquet('data.parquet', columns=['col1', 'col2']) — Parquet's columnar storage means unselected columns are never read from disk, giving 10–100× faster reads for narrow queries vs JSON which must parse every field on every row. Timestamps are serialized as epoch milliseconds by default — pass date_format='iso' to get ISO 8601 strings. See the JSON to DataFrame guide for more DataFrame-to-JSON output options.

When should I use Parquet instead of JSON?

Use Parquet when: (1) you run analytical queries that scan specific columns across many rows — Parquet reads only the queried columns, giving 10–100× faster scans than JSON which must parse every field on every row; (2) you need to store data long-term at scale — Parquet is 5–10× smaller than equivalent JSON; (3) you use a data processing framework (Spark, Dask, DuckDB, Athena, BigQuery) — all support Parquet natively and much faster than JSON; (4) your schema is stable and well-typed — Parquet enforces column types, preventing silent type mismatches. Keep JSON when: data is small (under a few MB) and human readability matters; you need to stream or append records one at a time without reading the whole file (use JSONL instead); your downstream consumers are web APIs or browsers expecting JSON; the schema changes frequently per record. A common production pattern is to accept JSON as input, convert to Parquet for storage and analytics, and convert back to JSON only for API responses — using Parquet as the source of truth for the data layer.

Definitions

Columnar storage
A file layout where all values for a single column are stored contiguously on disk, rather than row by row. Columnar storage enables query engines to read only the columns needed for a query, skipping irrelevant data entirely — the key reason Parquet is 10–100× faster than JSON for analytical column scans.
Parquet
An open-source columnar binary file format for Hadoop, Spark, and cloud data warehouses, defined by the Apache Parquet project. Parquet stores data in row groups (horizontal partitions), column chunks (vertical slices), and pages (encoded data blocks), with min/max statistics per row group for predicate pushdown.
Snappy compression
A fast lossless compression algorithm developed by Google, optimized for speed rather than maximum compression ratio. Snappy gives 3–5× size reduction on typical Parquet data and decompresses faster than Gzip or Zstd, making it the default codec for hot analytical data in most Parquet workflows.
Dictionary encoding
A Parquet encoding where unique values in a column are stored once in a dictionary, and each row stores only an integer index into that dictionary. Highly effective for low-cardinality string columns (country codes, status fields) — reduces storage from bytes-per-value to bits-per-value for repeated strings.
PyArrow
The Python binding for Apache Arrow, a cross-language columnar in-memory data format. PyArrow is the engine behind pandas Parquet I/O (via engine='pyarrow') and provides its own Parquet reader/writer API for fine-grained control over schema, compression, row groups, and streaming writes.
Schema inference
The automatic detection of column data types from sample data. pandas and PyArrow infer types when no explicit schema is provided — convenient for clean data but unreliable for mixed-type columns or sparse JSON where null-heavy columns may be inferred as float64 instead of int64.

Ready to convert JSON to Parquet?

Start by validating and inspecting your JSON structure with Jsonic's formatter, then run the pandas or PyArrow code above. For more Python JSON data patterns, see the JSON to DataFrame guide and the parse JSON in Python guide.

Open JSON Formatter