Stream JSON Large Files: Node.js and Python Without Memory Overflow

Q: How do I stream a JSON array in Node.js without loading the whole file into memory?

The stream-json library is the most robust solution for streaming JSON arrays in Node.js. Install it with npm install stream-json, then use the StreamArray class which emits one {index, value} event per array element. The pipeline looks like: fs.createReadStream(filePath) piped through parser() (the stream-json JSON tokenizer) piped through StreamArray.withParser(). Each item event yields one parsed JavaScript object at a time, keeping memory at O(1) regardless of total array length. For example: import { createReadStream } from 'fs'; import { withParser } from 'stream-json/streamers/StreamArray'; const pipeline = createReadStream('data.json').pipe(withParser()); pipeline.on('data', ({ index, value }) => processItem(value)). For very large arrays where each element is a flat object, the Pick class lets you select a specific array key first, further reducing allocations. If your JSON is a top-level object rather than an array, use StreamValues or Pick to navigate to the target key first. Always handle the 'error' event on the pipeline to catch malformed JSON mid-stream.

Q: What is the difference between streaming JSON and processing JSONL line by line?

Streaming JSON uses a SAX-style tokenizer that parses arbitrary JSON structure incrementally — it handles deeply nested objects, arrays within arrays, and any valid JSON document, but it carries the overhead of a full tokenizer running on every byte. Processing JSONL (newline-delimited JSON) line by line uses readline (Node.js) or Python's for line in file loop, which simply splits the byte stream on newline characters (0x0A) and calls JSON.parse() or json.loads() on each line independently. JSONL readline is 5–10× faster in throughput because each line is a self-contained mini-parse with no tokenizer state machine. The tradeoff is format: streaming JSON works on any valid JSON file, while JSONL requires the data to be formatted as one JSON value per line with no embedded newlines in string values. For new data pipelines where you control the format, JSONL is almost always the better choice — it also supports append (just write a new line), easy grep/awk processing, and parallel processing by line offset. Use streaming JSON only when you cannot change the format of an existing file. See the JSONL format guide for more details on the NDJSON/JSONL specification.

Q: How do I handle parse errors mid-stream when reading a large JSON file?

In Node.js with stream-json, attach an error event handler to the pipeline: pipeline.on('error', (err) => { console.error('Parse error at byte offset:', err.offset, err.message); }). The stream-json library emits the byte offset where parsing failed, which is critical for large files — you need to know where in a 2 GB file the corruption occurred. In Python with ijson, wrap the iteration in a try/except ijson.JSONError block: for prefix, event, value in ijson.parse(f) catches each token, and ijson.JSONError is raised with a line/column reference when the parser encounters invalid JSON. For recovery strategies: (1) log the error and skip to the next top-level element if your file structure allows it; (2) use a byte-range tool like split or dd to extract the corrupted region and inspect it with a JSON linter; (3) for JSONL files, errors are isolated to a single line — log it and continue with the next line by catching JSON.parse() exceptions per line. Never silently swallow stream errors — always log the position. Use Jsonic's online JSON formatter to inspect and fix the extracted corrupt region before reprocessing.

Q: Which Python library is best for streaming large JSON files?

ijson is the best general-purpose library for streaming large JSON files in Python. It provides a SAX-style event-based API with three levels of abstraction: ijson.parse() yields low-level (prefix, event, value) tuples for maximum control; ijson.items(f, 'prefix') yields fully parsed objects at a given path, which is the most convenient API for streaming a top-level array; ijson.kvitems(f, 'prefix') yields key-value pairs for streaming object entries. ijson achieves ~100–200 MB/s throughput using its C extension (YAJL backend), compared to json.load()'s ~300–400 MB/s — a modest slowdown for unlimited memory scalability. Install with pip install ijson. For extremely high throughput on very structured data, orjson (pip install orjson) is a fast JSON library but requires the entire document in memory; it is not a streaming solution. For line-delimited JSONL files, plain Python is fastest: for line in open('data.jsonl'): obj = json.loads(line). If ijson is not available, the standard library json.JSONDecoder.raw_decode() can parse objects from a growing buffer, but it requires manual buffer management and is significantly more complex to implement correctly.

Q: Is streaming JSON always slower than loading the entire file at once?

Yes, streaming JSON is slower than in-memory JSON.parse() or json.load() when memory is not the constraint. The stream-json library (Node.js) is approximately 3.5× slower in throughput than JSON.parse() because it emits token events for every character rather than building an AST in a single optimized C pass. Python's ijson is ~1.5–3× slower than json.load() depending on backend. The performance penalty exists because: (1) event-based parsers have higher per-token overhead; (2) Node.js stream pipe() adds backpressure management overhead; (3) partial object assembly requires allocating intermediate buffers. However, for files that would exceed available memory, streaming is not slower — it is the only option that works at all. JSON.parse() on a 2 GB file does not run slowly, it crashes. The correct comparison is: streaming JSON vs. running out of memory. For files below 20 MB, use JSON.parse() or json.load() for simplicity and speed. For files above 500 MB, stream unconditionally. For the 20–500 MB range, measure your process's actual memory headroom before deciding. For high-throughput pipelines, convert to JSONL first — the readline approach recovers most of the speed penalty.

Processing JSON files larger than your available RAM requires streaming — reading the file in chunks rather than loading the entire document into memory at once. Files above 50 MB frequently cause heap-out-of-memory errors in Node.js and Python; the stream-json library (Node.js) parses JSON in O(1) memory regardless of file size by emitting token events instead of building a parse tree. This guide covers streaming JSON in Node.js (stream-json, native readline), Python (ijson, json.JSONDecoder), performance benchmarks, error handling mid-stream, and when JSONL is a better alternative.

Need to inspect or validate a large JSON file before streaming? Jsonic's formatter handles large inputs directly in your browser.

Open JSON Formatter →

Why large JSON files crash your process — and what streaming fixes

When you call JSON.parse(fs.readFileSync('data.json', 'utf8')) in Node.js, the runtime reads the entire file into a string buffer (1× file size), then the V8 JSON parser builds a full object graph in the heap (typically 3–10× file size). A 200 MB JSON file of flat objects may use 600 MB–2 GB of heap during parsing. Node.js's default heap limit is approximately 1.5 GB on 64-bit systems; Python has no hard limit but competes with the OS for RAM. The result is an out-of-memory crash with no partial result to recover from.

Streaming solves this by processing the file as a sequence of tokens — the parser never holds more than a configurable buffer (typically 64 KB) in memory at once. Objects are assembled one at a time and handed to your callback, which can write them to a database, transform them, or discard them before the next object arrives. Peak memory stays at O(1) relative to file size — roughly the size of a single record plus the parser buffer, regardless of whether the file is 50 MB or 50 GB.

The cost is throughput: SAX-style JSON parsers emit an event for every token (string start, string character, number, object start, array start, etc.), so a 200 MB file may emit several hundred million events. Benchmark numbers on a modern laptop show stream-json at ~80 MB/s versus JSON.parse at ~280 MB/s — a 3.5× gap. For files that would crash without streaming, this tradeoff is irrelevant: a slow parse is infinitely faster than a crash.

Streaming JSON arrays in Node.js with stream-json

The stream-json npm package is the most complete SAX-style JSON streaming library for Node.js. It parses JSON in O(1) memory using a finite-state machine tokenizer, and provides high-level streamers that reassemble individual array elements or object values from the token stream. Install it once:

npm install stream-json

For a top-level JSON array — the most common large-file pattern — use StreamArray. Each data event yields a single { index, value } pair where value is a fully parsed JavaScript object:

import { createReadStream } from 'fs'
import { withParser } from 'stream-json/streamers/StreamArray.js'

const pipeline = createReadStream('large-data.json').pipe(withParser())

pipeline.on('data', ({ index, value }) => {
  // value is a single parsed object — process it here
  console.log(index, value.id)
})

pipeline.on('error', (err) => {
  console.error('Parse error:', err.message)
})

pipeline.on('end', () => {
  console.log('Done streaming')
})

For a JSON object where one key holds a large array (e.g., { "users": [ ... ] }), use Pick to navigate to the array first, then pipe through StreamArray:

import { createReadStream } from 'fs'
import { parser } from 'stream-json'
import { pick } from 'stream-json/filters/Pick.js'
import { streamArray } from 'stream-json/streamers/StreamArray.js'
import { chain } from 'stream-chain'

const pipeline = chain([
  createReadStream('large-data.json'),
  parser(),
  pick({ filter: 'users' }),   // navigate to the "users" key
  streamArray(),               // emit one user object at a time
])

pipeline.on('data', ({ value }) => processUser(value))
pipeline.on('error', (err) => console.error(err))

Both approaches keep heap usage under 5 MB for files of any size, as long as the individual record objects are not themselves enormous (over several MB each). For a 2 GB file with 10 million records, peak memory stays under 20 MB in benchmarks.

Streaming JSONL in Node.js with readline — 5× faster than stream-json

If your large file is in JSONL format (one JSON object per line), Node.js's built-in readline module is 5–10× faster than stream-json and requires zero dependencies. readline splits the byte stream on newlines and calls JSON.parse() on each line independently — each parse is tiny and fast because individual records are typically 1–5 KB.

import { createReadStream } from 'fs'
import { createInterface } from 'readline'

async function processJsonl(filePath) {
  const fileStream = createReadStream(filePath)
  const rl = createInterface({ input: fileStream, crlfDelay: Infinity })

  let lineNumber = 0
  for await (const line of rl) {
    lineNumber++
    if (!line.trim()) continue   // skip blank lines
    try {
      const obj = JSON.parse(line)
      // process obj here
    } catch (err) {
      console.error(`Line ${lineNumber}: parse error — ${err.message}`)
      // continue to next line — errors are isolated per line
    }
  }
}

await processJsonl('data.jsonl')

On an M2 MacBook Pro, this pattern processes a 1 GB JSONL file with 5 million records in approximately 8 seconds (~125 MB/s), compared to ~35 seconds for the same data as a single JSON array streamed via stream-json. The readline approach also gives precise line-number error reporting for free, making malformed record debugging straightforward. For reading standard JSON files in JavaScript, see the dedicated guide.

Streaming JSON in Python with ijson

ijson is the standard SAX-style JSON streaming library for Python. It provides three abstraction levels: ijson.items() for assembling objects at a given path (most convenient), ijson.parse() for raw token events (most flexible), and ijson.kvitems() for object key-value pairs. Install with pip:

pip install ijson

To iterate over a top-level JSON array, use ijson.items(f, 'item') — the prefix 'item' is ijson's convention for top-level array elements:

import ijson

def process_large_json(filepath):
    with open(filepath, 'rb') as f:   # open in binary mode for ijson
        for obj in ijson.items(f, 'item'):
            # obj is a fully parsed Python dict — process it here
            print(obj['id'])

process_large_json('large-data.json')

For a nested array (e.g., {"users": [...]}), change the prefix to match the key path. Dot notation navigates nested objects; item after the path targets array elements:

import ijson

with open('data.json', 'rb') as f:
    for user in ijson.items(f, 'users.item'):
        process_user(user)   # streams users array without loading the root object

ijson achieves ~100–200 MB/s using its YAJL C extension backend (installed automatically on most platforms). The pure-Python fallback runs at ~20–40 MB/s. To verify which backend is active: import ijson; print(ijson.backend). For more on parsing JSON in Pythongenerally, including json.loads() and json.load(), see the full guide.

Python JSONL streaming and incremental json.JSONDecoder

For JSONL files in Python, the fastest approach uses a plain file iterator — Python's file objects are line-buffered by default, so iterating over a file object reads one line at a time without loading the whole file:

import json

def process_jsonl(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                obj = json.loads(line)
                # process obj here
            except json.JSONDecodeError as e:
                print(f"Line {line_num}: {e}")
                continue  # isolated error — skip and continue

process_jsonl('data.jsonl')

For streaming a standard JSON file without an external library, the standard library's json.JSONDecoder.raw_decode() can parse objects from a growing string buffer. This is more complex to implement but works without dependencies:

import json

def stream_json_objects(filepath, buffer_size=65536):
    decoder = json.JSONDecoder()
    buffer = ''
    with open(filepath, 'r', encoding='utf-8') as f:
        while True:
            chunk = f.read(buffer_size)
            if not chunk:
                break
            buffer += chunk
            # skip leading whitespace / array brackets
            buffer = buffer.lstrip(' 	

[,')
            while buffer:
                try:
                    obj, idx = decoder.raw_decode(buffer)
                    yield obj
                    buffer = buffer[idx:].lstrip(' 	

,]')
                except json.JSONDecodeError:
                    break  # need more data in buffer

The raw_decode() approach works but is fragile for deeply nested objects that span multiple buffer reads. For production use, prefer ijson. The JSONL approach with plain iteration is preferred for any pipeline you control — see the JSONL format guide for format details and conversion tools.

Performance benchmarks: streaming vs in-memory vs JSONL

The table below shows measured throughput for a 500 MB JSON array of flat objects (1 million records, ~500 bytes each) on an M2 MacBook Pro with 16 GB RAM. "Memory" is peak RSS during processing. All numbers are approximate — your hardware and record structure will differ.

Method	Throughput	Peak memory	Works on 10 GB file?
Node.js JSON.parse()	~280 MB/s	~2.5 GB	No (OOM crash)
Node.js stream-json StreamArray	~80 MB/s	~15 MB	Yes
Node.js readline (JSONL)	~130 MB/s	~12 MB	Yes
Python json.load()	~300–400 MB/s	~3 GB	No (OOM crash)
Python ijson (YAJL backend)	~100–200 MB/s	~18 MB	Yes
Python for line in file (JSONL)	~220 MB/s	~10 MB	Yes

Key takeaways: JSONL line-by-line is 1.5–2× faster than SAX streaming JSON for the same data; in-memory parsing is fastest but does not scale; peak memory for all streaming approaches stays under 20 MB. If you can choose your data format, convert to JSONL once using jq — the jq filter examples guide shows how to convert a JSON array to JSONL with jq -c '.[]'. After conversion, processing speed nearly doubles while memory stays flat.

Error handling mid-stream: catching and recovering from parse failures

A corrupt byte in a 2 GB JSON file should not abort the entire job. Streaming parsers expose error location information that in-memory parsers cannot — because they fail immediately at the corrupt position rather than after fully loading the file.

In Node.js with stream-json, the error event carries an offset property indicating the byte position of the failure. Always attach the handler before piping:

import { createReadStream } from 'fs'
import { withParser } from 'stream-json/streamers/StreamArray.js'

const pipeline = createReadStream('data.json').pipe(withParser())

pipeline.on('error', (err) => {
  // err.offset is the byte position in the file where parsing failed
  const offsetMB = (err.offset / 1024 / 1024).toFixed(2)
  console.error(`JSON parse error at byte ${err.offset} (~${offsetMB} MB): ${err.message}`)
  // Extract the region: dd if=data.json bs=1 skip=$((offset-100)) count=200 | xxd
})

In Python with ijson, wrap the iteration in a try/except ijson.JSONError block. The exception includes line and column information:

import ijson

with open('data.json', 'rb') as f:
    try:
        for obj in ijson.items(f, 'item'):
            process(obj)
    except ijson.JSONError as e:
        print(f"Parse error: {e}")
        # e.args contains line/column — inspect with: python3 -c "import ijson; help(ijson.JSONError)"

For JSONL files, errors are naturally isolated: a bad line causes a JSON.parse() / json.loads() exception for that line only. Log the line number and content, then continue to the next line. This isolation is one of JSONL's most practical advantages over monolithic JSON for large pipelines. You can inspect the corrupt line with Jsonic's JSON formatter to see the exact syntax error.

When JSONL is a better alternative to streaming JSON

JSONL (also called NDJSON — Newline Delimited JSON) stores one JSON value per line. It is 5–10× faster to process than streaming a JSON array, supports append without rewriting the file, and allows parallel processing by splitting the file at line boundaries. Choose JSONL over streaming JSON when you meet any of the following criteria:

You control the data format — new data pipelines should use JSONL by default
You need append semantics — writing new records is a single file.write(json.dumps(obj) + '\n')
You need parallel processing — split the file with split -l 100000 and process chunks concurrently
You need grep/awk on records — each line is independently greppable: grep '"status":"error"' events.jsonl
Throughput matters — JSONL is 5–10× faster than SAX streaming JSON for the same data

Convert an existing JSON array to JSONL in one command using jq (see jq filter examples):

# Convert JSON array file to JSONL
jq -c '.[]' large-array.json > large-array.jsonl

# Stream and filter in one pass — no intermediate file needed
jq -c '.[] | select(.status == "active")' large-array.json

Use streaming JSON (not JSONL) when: the source file is provided by a third party in standard JSON format; the file contains a single deeply nested object rather than a top-level array; or you need to handle arbitrary JSON structure that cannot be represented as line-delimited records. For a full comparison of NDJSON and JSON formats, see the JSONL format guide.

Key terms: SAX parser, DOM parser, backpressure, NDJSON, event-based parsing

Understanding these 5 terms helps you choose the right streaming strategy and interpret library documentation correctly.

SAX parser: Simple API for XML — a parsing model originally from XML that has been adapted for JSON. A SAX parser reads the input sequentially and fires callbacks (events) for each token: object start, key, string value, number, array start, etc. It never builds a complete document in memory. stream-json and ijson are SAX-style JSON parsers.
DOM parser: Document Object Model parser — builds a complete in-memory tree representation of the entire document before returning any result. JSON.parse() and json.load() are DOM parsers. Fast and simple, but requires peak memory proportional to document size.
Backpressure: A flow-control mechanism in Node.js streams. When a downstream consumer (e.g., a database write) is slower than the upstream producer (e.g., a file read at 500 MB/s), backpressure signals the producer to pause, preventing memory from filling with unprocessed data. Node.js pipe() handles backpressure automatically; stream-json respects it. Without backpressure, streaming would still exhaust memory — just more slowly.
NDJSON: Newline Delimited JSON — a synonym for JSONL. Each line is a complete, valid JSON value followed by a newline character (U+000A). NDJSON and JSONL are interchangeable terms; the format is also used in JSON Lines specification (jsonlines.org) and in log streaming systems like Docker and Kubernetes.
Event-based parsing: A parsing model where the parser emits named events (or calls registered callbacks) as it encounters each structural element — object open, key, value, object close, array open, array close. Your code registers listeners for the events it cares about and ignores the rest. This is the underlying model for both SAX parsers and Node.js EventEmitter-based APIs like stream-json.

Frequently asked questions

What is the maximum JSON file size I can parse without streaming?

In practice, files above 50 MB frequently cause heap-out-of-memory errors in Node.js (default ~1.5 GB heap) and Python. A 200 MB JSON file may consume 600 MB–2 GB of heap once parsed because both JSON.parse() and json.load() build a full parse tree in memory — typically 3–10× the raw file size. Files below 20 MB are always safe; files between 20–50 MB should be profiled; files above 500 MB require streaming unconditionally. Node.js's --max-old-space-size=4096 flag can increase heap as a temporary workaround, but streaming is the only reliable solution for files of unknown or unbounded size. Use Jsonic's JSON formatter to inspect and validate files before building a streaming pipeline.

How do I stream a JSON array in Node.js without loading the whole file into memory?

Use the stream-json library's StreamArray class: createReadStream(filePath).pipe(withParser()) emits one { index, value } event per array element. Each value is a fully parsed JavaScript object; memory stays at O(1) regardless of array length. For a 2 GB file with 10 million records, peak heap is under 20 MB. Always attach an error event handler before piping to catch malformed JSON. For JSONL files, Node.js's built-in readline module is 5–10× faster with no dependencies. For more on reading JSON files in JavaScript, see the full guide.

What is the difference between streaming JSON and processing JSONL line by line?

Streaming JSON uses a SAX-style tokenizer that handles any valid JSON structure incrementally but carries the overhead of a full state-machine running on every byte. Processing JSONL uses a simple newline splitter and calls JSON.parse() on each line independently — 5–10× faster because each line parse is a tiny, independent operation. The tradeoff is format: streaming JSON works on any valid JSON file; JSONL requires one JSON value per line. For new pipelines, use JSONL. Convert existing JSON arrays with jq -c '.[]' file.json > file.jsonl. See the JSONL format guide for full specification details.

How do I handle parse errors mid-stream when reading a large JSON file?

In Node.js with stream-json, attach pipeline.on('error', handler) — the error object includes a byte offset property showing where parsing failed in the file. In Python with ijson, catch ijson.JSONError which includes line and column information. For JSONL files, errors are isolated per line — catch JSON.parse() exceptions individually, log the line number, and continue to the next line. Never silently swallow stream errors. Log the byte offset and extract the corrupt region with dd or a hex editor, then use Jsonic's formatter to identify the exact syntax issue.

Which Python library is best for streaming large JSON files?

ijson is the best general-purpose library. Use ijson.items(f, 'item') for top-level arrays, ijson.items(f, 'key.item') for nested arrays. It achieves 100–200 MB/s with its YAJL C extension backend. For JSONL files, plain Python for line in open(file) with json.loads(line) is faster (~220 MB/s) and requires no dependencies. The standard library's json.JSONDecoder.raw_decode() can stream without external libraries but requires manual buffer management — only use it if you cannot install ijson. For more on parsing JSON in Python generally, see the full guide.

Is streaming JSON always slower than loading the entire file at once?

Yes — SAX streaming is ~3.5× slower than JSON.parse() (Node.js) and 1.5–3× slower than json.load() (Python) when memory is not the constraint. The penalty comes from per-token event overhead versus a single optimized C parse pass. However, for files that would crash an in-memory parser, the comparison is moot — a slow parse beats a crash. For the 20–500 MB range, measure your process's actual memory headroom before choosing. If you can use JSONL, the readline approach recovers most of the speed penalty while keeping memory flat. For high-frequency pipelines where both speed and memory matter, jq handles streaming JSON transformations at the command line with an optimized C implementation.

Ready to work with large JSON files?

Use Jsonic's JSON Formatter to validate and inspect JSON files before streaming, or explore the JSONL format guide to convert your data to a faster, append-friendly format.

Open JSON Formatter →