JSON Lines (NDJSON) in Python: Read, Write, and Stream .jsonl Files

Q: How do I read a JSON Lines file in Python?

Use the built-in json module with a line-by-line loop: with open("file.jsonl") as f: records = [json.loads(line) for line in f if line.strip()]. For cleaner syntax, use pip install jsonlines and with jsonlines.open("file.jsonl") as reader: records = list(reader). Both approaches process in streaming fashion — no need to load the whole file.

Q: How do I append records to a JSON Lines file?

Open with mode="a" (append): with open("file.jsonl", "a") as f: f.write(json.dumps(record) + "\n"). This appends without reading the existing file. JSON Lines files are safe to append because each line is independent. Contrast with JSON arrays, which require reading and rewriting the entire file to add an element.

Q: How do I handle malformed lines in a JSON Lines file?

Wrap json.loads() in a try/except: try: obj = json.loads(line) except json.JSONDecodeError as e: print(f"Skipping line {i}: {e}"); continue. Track line numbers with enumerate(f). Log errors to a separate file for review. The jsonlines package raises jsonlines.InvalidLineError for the same purpose.

Q: Can I use pandas to read a JSON Lines file?

Yes. pd.read_json("file.jsonl", lines=True) reads the entire file into a DataFrame in one call. For large files, use chunksize=10000: for chunk in pd.read_json("large.jsonl", lines=True, chunksize=10000): process(chunk). Write back with df.to_json("output.jsonl", orient="records", lines=True).

Q: What file extension should I use for JSON Lines files?

.jsonl is the most common extension (used by Hugging Face, most ML tools). .ndjson (Newline Delimited JSON) is equally valid. .jl is sometimes used in the scrapy web scraping framework. All 3 refer to the same format. The official MIME type is application/x-ndjson.

JSON Lines (also called NDJSON — Newline Delimited JSON) stores one JSON object per line in a plain text file, enabling streaming processing without loading the entire file into memory. Python's built-in json module reads JSON Lines with a simple loop: for line in file: obj = json.loads(line.strip()). Writing is equally direct: file.write(json.dumps(obj) + "\n"). The jsonlines package on PyPI adds a cleaner API: jsonlines.open("file.jsonl") returns an iterable reader that yields Python dicts. JSON Lines is the standard format for machine learning datasets, log aggregation (Elasticsearch, Fluentd), and LLM training data — the Pile, CommonCrawl, and most Hugging Face datasets use .jsonl. A 1 GB JSON Lines file can be processed in constant memory by reading one line at a time. Python's pandas.read_json("file.jsonl", lines=True) loads JSON Lines into a DataFrame with a single call. This guide covers reading, writing, streaming, the jsonlines package, error handling, and JSON to pandas DataFrame integration.

Need to validate or inspect individual JSON Lines records? Jsonic's formatter handles one record at a time instantly.

Open JSON Formatter

Read JSON Lines with the Built-in json Module

Open with open("file.jsonl", encoding="utf-8") and iterate line by line — this is the fastest way to read a NDJSON/JSON Lines format file without installing any extra package. strip() removes the trailing \n. Skip empty lines with if not line.strip(): continue. For large files, this processes in O(1) memory — Python's file iterator reads one line at a time and the line buffer is reused. A 10 GB file consumes the same peak memory as a 10-byte file.

import json

def read_jsonl(path: str) -> list[dict]:
    """Read all records from a JSON Lines file.

    Returns a list of dicts. Skips empty lines.
    Logs and skips malformed lines instead of crashing.
    """
    records = []
    with open(path, encoding="utf-8") as f:
        for i, line in enumerate(f, start=1):
            line = line.strip()
            if not line:          # skip blank lines
                continue
            try:
                records.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Line {i}: skipping malformed JSON — {e}")
    return records

# Usage
records = read_jsonl("data.jsonl")
print(f"Loaded {len(records)} records")   # e.g. Loaded 50000 records

3 things to note: (1) encoding="utf-8" is explicit — always specify it to avoid OS-dependent encoding surprises on Windows; (2) enumerate(f, start=1) gives 1-based line numbers that match the file's line count as reported by wc -l; (3) catching json.JSONDecodeError per line lets the loop continue past bad records, which is the correct behavior for production pipelines where 1 in 10,000 lines may be corrupt. To learn more about how to parse JSON in Python generally, see that dedicated guide.

Write JSON Lines with json.dumps

Write JSON Lines records with f.write(json.dumps(record, ensure_ascii=False) + "\n") — one line per record, newline-terminated. Open in "w" to create or overwrite, or "a" to append to an existing file without reading it first. No library beyond the built-in json module is required. For a 100,000-record dataset, this writes to disk in under 2 seconds on modern hardware.

import json

def write_jsonl(records: list[dict], path: str, append: bool = False) -> None:
    """Write a list of dicts to a JSON Lines file.

    append=True adds to the end of an existing file.
    ensure_ascii=False preserves Unicode (Chinese, Arabic, emoji, etc.)
    separators=(",", ":") produces compact JSON — no extra spaces.
    """
    mode = "a" if append else "w"
    with open(path, mode, encoding="utf-8") as f:
        for record in records:
            f.write(json.dumps(record, ensure_ascii=False, separators=(",", ":")) + "\n")

# Create a new file
records = [
    {"id": 1, "name": "Alice", "city": "上海"},
    {"id": 2, "name": "Bob",   "city": "New York"},
]
write_jsonl(records, "people.jsonl")

# Append 1 more record later — without touching the existing data
new_record = {"id": 3, "name": "Charlie", "city": "London"}
write_jsonl([new_record], "people.jsonl", append=True)

ensure_ascii=False is critical for multilingual data — without it, Chinese characters like 上海 are written as \u4e0a\u6d77, doubling file size and making the output unreadable in a text editor. separators=(",", ":") removes the default spaces after commas and colons, reducing file size by roughly 15% for typical record shapes. See the json.dumps() in Python guide for a full breakdown of every parameter including sort_keys, default for custom serializers, and indent.

The jsonlines Package

pip install jsonlines adds a cleaner API than raw json for production code. The jsonlines.open() context manager handles encoding, newlines, and buffer flushing automatically — 3 things you must manage manually with the built-in module. For scripts and one-off data processing, plain json is sufficient. For production pipelines that read and write JSON Lines files daily,jsonlines reduces boilerplate by roughly 40%.

pip install jsonlines

import jsonlines

# ── Reading ───────────────────────────────────────────────────────────
with jsonlines.open("data.jsonl") as reader:
    for obj in reader:          # yields one dict per line
        process(obj)

# Read all into a list
with jsonlines.open("data.jsonl") as reader:
    records = list(reader)      # 50,000 records in ~0.3s for a 100 MB file

# Read a single record (useful for peeking at the schema)
with jsonlines.open("data.jsonl") as reader:
    first = reader.read()       # reads exactly 1 record, leaves cursor at line 2

# ── Writing ───────────────────────────────────────────────────────────
records = [{"id": 1, "val": "a"}, {"id": 2, "val": "b"}]

# Write all at once
with jsonlines.open("output.jsonl", mode="w") as writer:
    writer.write_all(records)

# Write records one at a time (e.g. from a generator)
with jsonlines.open("output.jsonl", mode="w") as writer:
    for record in generate_records():
        writer.write(record)    # flushes each record immediately

# Append mode
with jsonlines.open("output.jsonl", mode="a") as writer:
    writer.write({"id": 3, "val": "c"})

Compare the two approaches: jsonlines is cleaner for production code because the context manager guarantees the file is closed and flushed even if an exception occurs mid-write. Plain json is fine for scripts where you control the full lifecycle. The jsonlines package also raises jsonlines.InvalidLineError (a subclass of json.JSONDecodeError) for malformed lines, making error handling equally straightforward. Both approaches iterate in O(1) memory — neither loads the entire file at once.

Streaming Large JSON Lines Files

For files over 1 GB, process line-by-line to avoid out-of-memory errors — a 10 GB JSON Lines file read line-by-line uses under 10 MB of RAM regardless of file size. The generator pattern is the idiomatic Python approach: yield one parsed object at a time so the caller controls how many records are held in memory at once. Use itertools.islice to take the first N records without reading the entire file, and multiprocessing.Pool for parallel processing of large batches. A single core can process roughly 500,000 records per second for simple transformations.

import json
import itertools
from typing import Generator

# ── Generator: yields 1 record at a time, O(1) memory ─────────────────
def iter_jsonl(path: str) -> Generator[dict, None, None]:
    with open(path, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                yield json.loads(line)

# Take first 1,000 records without reading the whole file
first_1000 = list(itertools.islice(iter_jsonl("data.jsonl"), 1000))
print(f"First record: {first_1000[0]}")

# Count total records in a 10 GB file — reads line by line
total = sum(1 for _ in iter_jsonl("large.jsonl"))
print(f"Total records: {total}")

# ── Parallel processing with multiprocessing ──────────────────────────
import multiprocessing

def process_batch(batch: list[dict]) -> list[dict]:
    """Transform a batch of records — runs in a worker process."""
    return [{"id": r["id"], "value": r["value"] * 2} for r in batch]

def chunked(iterable, size: int):
    it = iter(iterable)
    while chunk := list(itertools.islice(it, size)):
        yield chunk

with multiprocessing.Pool(processes=4) as pool:
    results = pool.map(process_batch, chunked(iter_jsonl("large.jsonl"), 10000))

# wc -l counts records in a JSONL file (each line = 1 record)
# $ wc -l data.jsonl
# 50000 data.jsonl

itertools.islice(iter_jsonl("data.jsonl"), 1000) reads exactly 1,000 lines and closes the file — useful for schema inspection or sampling. The walrus operator := in chunked() requires Python 3.8+. For parallel processing, chunk sizes of 10,000 records typically give optimal throughput by amortizing process-pool overhead across enough work per task. Always profile your specific workload — CPU-bound transformations benefit from multiprocessing, while I/O-bound tasks (writing to a database per record) benefit from asyncio or thread pools instead.

JSON Lines with pandas

pd.read_json("file.jsonl", lines=True) loads all records into a DataFrame in a single call — the fastest path from a .jsonl file to tabular analysis. All lines must share the same keys for automatic column inference; for mixed schemas, read manually and use pd.json_normalize(). For files over 500 MB, use chunksize to iterate over DataFrames of fixed row counts instead of loading everything into memory at once.

import pandas as pd

# ── Load entire file into a DataFrame ─────────────────────────────────
df = pd.read_json("data.jsonl", lines=True)
print(df.shape)          # (50000, 8)  — 50k rows, 8 columns
print(df.dtypes)
print(df.head(3))

# ── Write a DataFrame back to JSON Lines ──────────────────────────────
df.to_json(
    "output.jsonl",
    orient="records",    # one JSON object per line
    lines=True,          # newline-delimited (required for JSONL)
    force_ascii=False,   # preserve Unicode
)

# ── Chunked reading for large files ───────────────────────────────────
# Returns a JsonReader iterator — each chunk is a DataFrame
chunk_iter = pd.read_json("large.jsonl", lines=True, chunksize=10_000)

results = []
for i, chunk in enumerate(chunk_iter):
    # chunk has at most 10,000 rows
    filtered = chunk[chunk["score"] > 0.5]
    results.append(filtered)
    print(f"Chunk {i}: {len(chunk)} rows, {len(filtered)} kept")

final_df = pd.concat(results, ignore_index=True)
print(f"Kept {len(final_df)} total records")

# ── Mixed schemas: use json_normalize ─────────────────────────────────
import json

raw = [json.loads(line) for line in open("mixed.jsonl") if line.strip()]
df_normalized = pd.json_normalize(raw, sep="_")   # flattens nested keys
print(df_normalized.columns.tolist())

chunksize=10_000 returns a JsonReader iterator — each iteration yields a DataFrame with at most 10,000 rows, keeping peak memory under control regardless of file size. See the full guide on JSON to pandas DataFrame for advanced normalization patterns including nested arrays, missing keys, and type coercion. pd.json_normalize() is essential when nested JSON objects (e.g. {"user": {"id": 1, "name": "Alice"}}) need to be flattened into columns like user_id and user_name.

Key Terms

JSON Lines (.jsonl): A text file format where each line contains exactly one valid JSON value (typically an object), terminated by a newline character, enabling streaming access to large datasets without loading the entire file.
NDJSON (Newline Delimited JSON): An alternative name for the JSON Lines format, governed by the ndjson.org specification; it is identical to .jsonl and uses the MIME type application/x-ndjson.
json.loads(): A Python built-in function that parses a JSON string and returns the equivalent Python object (dict, list, str, int, float, bool, or None); raises json.JSONDecodeError if the input is not valid JSON.
json.dumps(): A Python built-in function that serializes a Python object to a JSON-formatted string; accepts parameters including ensure_ascii, separators, indent, and default for custom serialization of non-standard types.
jsonlines package: A third-party Python library (installable via pip install jsonlines) that provides jsonlines.open() as a context manager for reading and writing JSON Lines files, handling encoding, newlines, and flushing automatically.
Generator (Python): A function that uses yield to produce values one at a time, enabling O(1) memory processing of arbitrarily large sequences by only holding one value in memory at a time — the idiomatic pattern for streaming large JSON Lines files.
pandas chunksize: A parameter of pd.read_json() that causes the function to return a JsonReader iterator instead of a DataFrame; each iteration yields a DataFrame with at most chunksize rows, enabling memory-efficient processing of files too large to fit in RAM.

Frequently asked questions

How do I read a JSON Lines file in Python?

Use the built-in json module with a line-by-line loop: with open("file.jsonl") as f: records = [json.loads(line) for line in f if line.strip()]. For cleaner syntax, install the jsonlines package with pip install jsonlines and use with jsonlines.open("file.jsonl") as reader: records = list(reader). Both approaches process in streaming fashion — no need to load the whole file. The built-in approach requires no dependencies and is best for scripts; the jsonlines package is cleaner for production pipelines. To learn more about parsing JSON generally, see how to parse JSON in Python.

What is the difference between JSON Lines and regular JSON files?

A JSON file stores a single JSON value (usually an array or object). A JSON Lines file stores multiple independent JSON objects, one per line. JSON Lines is better for streaming: you can read, append, and process records one at a time without parsing the whole file. Regular JSON arrays require reading the complete file before parsing — a 5 GB JSON array requires 5 GB of RAM to load, while a 5 GB JSON Lines file can be processed in under 50 MB of RAM using a generator. JSON Lines also supports concatenation with a simple shell command: cat a.jsonl b.jsonl > combined.jsonl. See the NDJSON/JSON Lines format guide for a complete spec breakdown.

How do I append records to a JSON Lines file?

Open with mode="a" (append): with open("file.jsonl", "a") as f: f.write(json.dumps(record) + "\n"). This appends without reading the existing file — the OS seeks to the end of the file and writes the new line. JSON Lines files are safe to append because each line is independent. Contrast with JSON arrays, which require reading and rewriting the entire file to add an element. With the jsonlines package, use jsonlines.open("file.jsonl", mode="a") for the same behavior with automatic flushing. Appending is safe even if the previous write was interrupted — the worst case is one malformed last line that you can skip with error handling.

How do I handle malformed lines in a JSON Lines file?

Wrap json.loads() in a try/except: try: obj = json.loads(line) except json.JSONDecodeError as e: print(f"Skipping line {i}: {e}"); continue. Track line numbers with enumerate(f) starting from 1. Log errors to a separate file for review rather than printing to stdout in production. The jsonlines package raises jsonlines.InvalidLineError for the same purpose — a subclass of json.JSONDecodeError, so existing handlers catch both. Common causes of malformed lines: interrupted writes (partial records), concatenating files with trailing non-JSON content, and CSV-to-JSONL conversion scripts that miss escaping.

Can I use pandas to read a JSON Lines file?

Yes. pd.read_json("file.jsonl", lines=True) reads the entire file into a DataFrame in one call — all lines must have the same keys for automatic column inference. For large files, use chunksize=10000: for chunk in pd.read_json("large.jsonl", lines=True, chunksize=10000): process(chunk). Write back with df.to_json("output.jsonl", orient="records", lines=True). For mixed schemas where different records have different keys, read manually and use pd.json_normalize(). See the full JSON to pandas DataFrame guide for advanced patterns.

What file extension should I use for JSON Lines files?

.jsonl is the most common extension, used by Hugging Face datasets, most ML tools, and the majority of data engineering pipelines. .ndjson (Newline Delimited JSON) is equally valid and preferred in some ecosystems like Elasticsearch and Fluentd. .jl is sometimes used in the scrapy web scraping framework. All 3 refer to exactly the same format — one JSON object per line, newline-terminated. The official MIME type is application/x-ndjson. When in doubt, use .jsonl — it is the most widely recognized extension and is understood by pandas, Hugging Face, and most cloud data services without configuration.

Ready to work with JSON Lines in Python?

Use Jsonic's JSON Formatter to validate individual records from your .jsonl file before processing. Paste a single line to check it is valid JSON and inspect its structure.

Open JSON Formatter