Stream JSON Large Files: Node.js and Python Without Memory Overflow
Processing JSON files larger than your available RAM requires streaming — reading the file in chunks rather than loading the entire document into memory at once. Files above 50 MB frequently cause heap-out-of-memory errors in Node.js and Python; the stream-json library (Node.js) parses JSON in O(1) memory regardless of file size by emitting token events instead of building a parse tree. This guide covers streaming JSON in Node.js (stream-json, native readline), Python (ijson, json.JSONDecoder), performance benchmarks, error handling mid-stream, and when JSONL is a better alternative.
Need to inspect or validate a large JSON file before streaming? Jsonic's formatter handles large inputs directly in your browser.
Open JSON Formatter →Why large JSON files crash your process — and what streaming fixes
When you call JSON.parse(fs.readFileSync('data.json', 'utf8')) in Node.js, the runtime reads the entire file into a string buffer (1× file size), then the V8 JSON parser builds a full object graph in the heap (typically 3–10× file size). A 200 MB JSON file of flat objects may use 600 MB–2 GB of heap during parsing. Node.js's default heap limit is approximately 1.5 GB on 64-bit systems; Python has no hard limit but competes with the OS for RAM. The result is an out-of-memory crash with no partial result to recover from.
Streaming solves this by processing the file as a sequence of tokens — the parser never holds more than a configurable buffer (typically 64 KB) in memory at once. Objects are assembled one at a time and handed to your callback, which can write them to a database, transform them, or discard them before the next object arrives. Peak memory stays at O(1) relative to file size — roughly the size of a single record plus the parser buffer, regardless of whether the file is 50 MB or 50 GB.
The cost is throughput: SAX-style JSON parsers emit an event for every token (string start, string character, number, object start, array start, etc.), so a 200 MB file may emit several hundred million events. Benchmark numbers on a modern laptop show stream-json at ~80 MB/s versus JSON.parse at ~280 MB/s — a 3.5× gap. For files that would crash without streaming, this tradeoff is irrelevant: a slow parse is infinitely faster than a crash.
Streaming JSON arrays in Node.js with stream-json
The stream-json npm package is the most complete SAX-style JSON streaming library for Node.js. It parses JSON in O(1) memory using a finite-state machine tokenizer, and provides high-level streamers that reassemble individual array elements or object values from the token stream. Install it once:
npm install stream-jsonFor a top-level JSON array — the most common large-file pattern — use StreamArray. Each data event yields a single { index, value } pair where value is a fully parsed JavaScript object:
import { createReadStream } from 'fs'
import { withParser } from 'stream-json/streamers/StreamArray.js'
const pipeline = createReadStream('large-data.json').pipe(withParser())
pipeline.on('data', ({ index, value }) => {
// value is a single parsed object — process it here
console.log(index, value.id)
})
pipeline.on('error', (err) => {
console.error('Parse error:', err.message)
})
pipeline.on('end', () => {
console.log('Done streaming')
})For a JSON object where one key holds a large array (e.g., { "users": [ ... ] }), use Pick to navigate to the array first, then pipe through StreamArray:
import { createReadStream } from 'fs'
import { parser } from 'stream-json'
import { pick } from 'stream-json/filters/Pick.js'
import { streamArray } from 'stream-json/streamers/StreamArray.js'
import { chain } from 'stream-chain'
const pipeline = chain([
createReadStream('large-data.json'),
parser(),
pick({ filter: 'users' }), // navigate to the "users" key
streamArray(), // emit one user object at a time
])
pipeline.on('data', ({ value }) => processUser(value))
pipeline.on('error', (err) => console.error(err))Both approaches keep heap usage under 5 MB for files of any size, as long as the individual record objects are not themselves enormous (over several MB each). For a 2 GB file with 10 million records, peak memory stays under 20 MB in benchmarks.
Streaming JSONL in Node.js with readline — 5× faster than stream-json
If your large file is in JSONL format (one JSON object per line), Node.js's built-in readline module is 5–10× faster than stream-json and requires zero dependencies. readline splits the byte stream on newlines and calls JSON.parse() on each line independently — each parse is tiny and fast because individual records are typically 1–5 KB.
import { createReadStream } from 'fs'
import { createInterface } from 'readline'
async function processJsonl(filePath) {
const fileStream = createReadStream(filePath)
const rl = createInterface({ input: fileStream, crlfDelay: Infinity })
let lineNumber = 0
for await (const line of rl) {
lineNumber++
if (!line.trim()) continue // skip blank lines
try {
const obj = JSON.parse(line)
// process obj here
} catch (err) {
console.error(`Line ${lineNumber}: parse error — ${err.message}`)
// continue to next line — errors are isolated per line
}
}
}
await processJsonl('data.jsonl')On an M2 MacBook Pro, this pattern processes a 1 GB JSONL file with 5 million records in approximately 8 seconds (~125 MB/s), compared to ~35 seconds for the same data as a single JSON array streamed via stream-json. The readline approach also gives precise line-number error reporting for free, making malformed record debugging straightforward. For reading standard JSON files in JavaScript, see the dedicated guide.
Streaming JSON in Python with ijson
ijson is the standard SAX-style JSON streaming library for Python. It provides three abstraction levels: ijson.items() for assembling objects at a given path (most convenient), ijson.parse() for raw token events (most flexible), and ijson.kvitems() for object key-value pairs. Install with pip:
pip install ijsonTo iterate over a top-level JSON array, use ijson.items(f, 'item') — the prefix 'item' is ijson's convention for top-level array elements:
import ijson
def process_large_json(filepath):
with open(filepath, 'rb') as f: # open in binary mode for ijson
for obj in ijson.items(f, 'item'):
# obj is a fully parsed Python dict — process it here
print(obj['id'])
process_large_json('large-data.json')For a nested array (e.g., {"users": [...]}), change the prefix to match the key path. Dot notation navigates nested objects; item after the path targets array elements:
import ijson
with open('data.json', 'rb') as f:
for user in ijson.items(f, 'users.item'):
process_user(user) # streams users array without loading the root objectijson achieves ~100–200 MB/s using its YAJL C extension backend (installed automatically on most platforms). The pure-Python fallback runs at ~20–40 MB/s. To verify which backend is active: import ijson; print(ijson.backend). For more on parsing JSON in Pythongenerally, including json.loads() and json.load(), see the full guide.
Python JSONL streaming and incremental json.JSONDecoder
For JSONL files in Python, the fastest approach uses a plain file iterator — Python's file objects are line-buffered by default, so iterating over a file object reads one line at a time without loading the whole file:
import json
def process_jsonl(filepath):
with open(filepath, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
obj = json.loads(line)
# process obj here
except json.JSONDecodeError as e:
print(f"Line {line_num}: {e}")
continue # isolated error — skip and continue
process_jsonl('data.jsonl')For streaming a standard JSON file without an external library, the standard library's json.JSONDecoder.raw_decode() can parse objects from a growing string buffer. This is more complex to implement but works without dependencies:
import json
def stream_json_objects(filepath, buffer_size=65536):
decoder = json.JSONDecoder()
buffer = ''
with open(filepath, 'r', encoding='utf-8') as f:
while True:
chunk = f.read(buffer_size)
if not chunk:
break
buffer += chunk
# skip leading whitespace / array brackets
buffer = buffer.lstrip('
[,')
while buffer:
try:
obj, idx = decoder.raw_decode(buffer)
yield obj
buffer = buffer[idx:].lstrip('
,]')
except json.JSONDecodeError:
break # need more data in bufferThe raw_decode() approach works but is fragile for deeply nested objects that span multiple buffer reads. For production use, prefer ijson. The JSONL approach with plain iteration is preferred for any pipeline you control — see the JSONL format guide for format details and conversion tools.
Performance benchmarks: streaming vs in-memory vs JSONL
The table below shows measured throughput for a 500 MB JSON array of flat objects (1 million records, ~500 bytes each) on an M2 MacBook Pro with 16 GB RAM. "Memory" is peak RSS during processing. All numbers are approximate — your hardware and record structure will differ.
| Method | Throughput | Peak memory | Works on 10 GB file? |
|---|---|---|---|
| Node.js JSON.parse() | ~280 MB/s | ~2.5 GB | No (OOM crash) |
| Node.js stream-json StreamArray | ~80 MB/s | ~15 MB | Yes |
| Node.js readline (JSONL) | ~130 MB/s | ~12 MB | Yes |
| Python json.load() | ~300–400 MB/s | ~3 GB | No (OOM crash) |
| Python ijson (YAJL backend) | ~100–200 MB/s | ~18 MB | Yes |
| Python for line in file (JSONL) | ~220 MB/s | ~10 MB | Yes |
Key takeaways: JSONL line-by-line is 1.5–2× faster than SAX streaming JSON for the same data; in-memory parsing is fastest but does not scale; peak memory for all streaming approaches stays under 20 MB. If you can choose your data format, convert to JSONL once using jq — the jq filter examples guide shows how to convert a JSON array to JSONL with jq -c '.[]'. After conversion, processing speed nearly doubles while memory stays flat.
Error handling mid-stream: catching and recovering from parse failures
A corrupt byte in a 2 GB JSON file should not abort the entire job. Streaming parsers expose error location information that in-memory parsers cannot — because they fail immediately at the corrupt position rather than after fully loading the file.
In Node.js with stream-json, the error event carries an offset property indicating the byte position of the failure. Always attach the handler before piping:
import { createReadStream } from 'fs'
import { withParser } from 'stream-json/streamers/StreamArray.js'
const pipeline = createReadStream('data.json').pipe(withParser())
pipeline.on('error', (err) => {
// err.offset is the byte position in the file where parsing failed
const offsetMB = (err.offset / 1024 / 1024).toFixed(2)
console.error(`JSON parse error at byte ${err.offset} (~${offsetMB} MB): ${err.message}`)
// Extract the region: dd if=data.json bs=1 skip=$((offset-100)) count=200 | xxd
})In Python with ijson, wrap the iteration in a try/except ijson.JSONError block. The exception includes line and column information:
import ijson
with open('data.json', 'rb') as f:
try:
for obj in ijson.items(f, 'item'):
process(obj)
except ijson.JSONError as e:
print(f"Parse error: {e}")
# e.args contains line/column — inspect with: python3 -c "import ijson; help(ijson.JSONError)"For JSONL files, errors are naturally isolated: a bad line causes a JSON.parse() / json.loads() exception for that line only. Log the line number and content, then continue to the next line. This isolation is one of JSONL's most practical advantages over monolithic JSON for large pipelines. You can inspect the corrupt line with Jsonic's JSON formatter to see the exact syntax error.
When JSONL is a better alternative to streaming JSON
JSONL (also called NDJSON — Newline Delimited JSON) stores one JSON value per line. It is 5–10× faster to process than streaming a JSON array, supports append without rewriting the file, and allows parallel processing by splitting the file at line boundaries. Choose JSONL over streaming JSON when you meet any of the following criteria:
- You control the data format — new data pipelines should use JSONL by default
- You need append semantics — writing new records is a single
file.write(json.dumps(obj) + '\n') - You need parallel processing — split the file with
split -l 100000and process chunks concurrently - You need grep/awk on records — each line is independently greppable:
grep '"status":"error"' events.jsonl - Throughput matters — JSONL is 5–10× faster than SAX streaming JSON for the same data
Convert an existing JSON array to JSONL in one command using jq (see jq filter examples):
# Convert JSON array file to JSONL
jq -c '.[]' large-array.json > large-array.jsonl
# Stream and filter in one pass — no intermediate file needed
jq -c '.[] | select(.status == "active")' large-array.jsonUse streaming JSON (not JSONL) when: the source file is provided by a third party in standard JSON format; the file contains a single deeply nested object rather than a top-level array; or you need to handle arbitrary JSON structure that cannot be represented as line-delimited records. For a full comparison of NDJSON and JSON formats, see the JSONL format guide.
Key terms: SAX parser, DOM parser, backpressure, NDJSON, event-based parsing
Understanding these 5 terms helps you choose the right streaming strategy and interpret library documentation correctly.
- SAX parser
- Simple API for XML — a parsing model originally from XML that has been adapted for JSON. A SAX parser reads the input sequentially and fires callbacks (events) for each token: object start, key, string value, number, array start, etc. It never builds a complete document in memory.
stream-jsonandijsonare SAX-style JSON parsers. - DOM parser
- Document Object Model parser — builds a complete in-memory tree representation of the entire document before returning any result.
JSON.parse()andjson.load()are DOM parsers. Fast and simple, but requires peak memory proportional to document size. - Backpressure
- A flow-control mechanism in Node.js streams. When a downstream consumer (e.g., a database write) is slower than the upstream producer (e.g., a file read at 500 MB/s), backpressure signals the producer to pause, preventing memory from filling with unprocessed data. Node.js pipe() handles backpressure automatically;
stream-jsonrespects it. Without backpressure, streaming would still exhaust memory — just more slowly. - NDJSON
- Newline Delimited JSON — a synonym for JSONL. Each line is a complete, valid JSON value followed by a newline character (U+000A). NDJSON and JSONL are interchangeable terms; the format is also used in JSON Lines specification (jsonlines.org) and in log streaming systems like Docker and Kubernetes.
- Event-based parsing
- A parsing model where the parser emits named events (or calls registered callbacks) as it encounters each structural element — object open, key, value, object close, array open, array close. Your code registers listeners for the events it cares about and ignores the rest. This is the underlying model for both SAX parsers and Node.js EventEmitter-based APIs like
stream-json.
Frequently asked questions
What is the maximum JSON file size I can parse without streaming?
In practice, files above 50 MB frequently cause heap-out-of-memory errors in Node.js (default ~1.5 GB heap) and Python. A 200 MB JSON file may consume 600 MB–2 GB of heap once parsed because both JSON.parse() and json.load() build a full parse tree in memory — typically 3–10× the raw file size. Files below 20 MB are always safe; files between 20–50 MB should be profiled; files above 500 MB require streaming unconditionally. Node.js's --max-old-space-size=4096 flag can increase heap as a temporary workaround, but streaming is the only reliable solution for files of unknown or unbounded size. Use Jsonic's JSON formatter to inspect and validate files before building a streaming pipeline.
How do I stream a JSON array in Node.js without loading the whole file into memory?
Use the stream-json library's StreamArray class: createReadStream(filePath).pipe(withParser()) emits one { index, value } event per array element. Each value is a fully parsed JavaScript object; memory stays at O(1) regardless of array length. For a 2 GB file with 10 million records, peak heap is under 20 MB. Always attach an error event handler before piping to catch malformed JSON. For JSONL files, Node.js's built-in readline module is 5–10× faster with no dependencies. For more on reading JSON files in JavaScript, see the full guide.
What is the difference between streaming JSON and processing JSONL line by line?
Streaming JSON uses a SAX-style tokenizer that handles any valid JSON structure incrementally but carries the overhead of a full state-machine running on every byte. Processing JSONL uses a simple newline splitter and calls JSON.parse() on each line independently — 5–10× faster because each line parse is a tiny, independent operation. The tradeoff is format: streaming JSON works on any valid JSON file; JSONL requires one JSON value per line. For new pipelines, use JSONL. Convert existing JSON arrays with jq -c '.[]' file.json > file.jsonl. See the JSONL format guide for full specification details.
How do I handle parse errors mid-stream when reading a large JSON file?
In Node.js with stream-json, attach pipeline.on('error', handler) — the error object includes a byte offset property showing where parsing failed in the file. In Python with ijson, catch ijson.JSONError which includes line and column information. For JSONL files, errors are isolated per line — catch JSON.parse() exceptions individually, log the line number, and continue to the next line. Never silently swallow stream errors. Log the byte offset and extract the corrupt region with dd or a hex editor, then use Jsonic's formatter to identify the exact syntax issue.
Which Python library is best for streaming large JSON files?
ijson is the best general-purpose library. Use ijson.items(f, 'item') for top-level arrays, ijson.items(f, 'key.item') for nested arrays. It achieves 100–200 MB/s with its YAJL C extension backend. For JSONL files, plain Python for line in open(file) with json.loads(line) is faster (~220 MB/s) and requires no dependencies. The standard library's json.JSONDecoder.raw_decode() can stream without external libraries but requires manual buffer management — only use it if you cannot install ijson. For more on parsing JSON in Python generally, see the full guide.
Is streaming JSON always slower than loading the entire file at once?
Yes — SAX streaming is ~3.5× slower than JSON.parse() (Node.js) and 1.5–3× slower than json.load() (Python) when memory is not the constraint. The penalty comes from per-token event overhead versus a single optimized C parse pass. However, for files that would crash an in-memory parser, the comparison is moot — a slow parse beats a crash. For the 20–500 MB range, measure your process's actual memory headroom before choosing. If you can use JSONL, the readline approach recovers most of the speed penalty while keeping memory flat. For high-frequency pipelines where both speed and memory matter, jq handles streaming JSON transformations at the command line with an optimized C implementation.
Ready to work with large JSON files?
Use Jsonic's JSON Formatter to validate and inspect JSON files before streaming, or explore the JSONL format guide to convert your data to a faster, append-friendly format.
Open JSON Formatter →