JSON UTF-8 Encoding: \uXXXX Escapes, BOM, charset Header, and Round-Trip Safety

Last updated:

JSON's encoding rules are short but strict: RFC 8259 §8.1 mandates UTF-8 for any JSON text exchanged between systems, the \uXXXX escape syntax is optional (raw UTF-8 bytes are equally valid), surrogate pairs encode codepoints above U+FFFF, the byte order mark is forbidden in output and ignorable on input, and application/json takes no charsetparameter because the encoding is fixed by the spec. Most encoding bugs in real systems come from defaults written when JSON still tolerated UTF-16 (Python's ensure_ascii=True, PowerShell's BOM-prefixed output, the still-common charset=utf-8 parameter on Content-Type) — knowing the current rules makes it easy to spot and fix them. This page covers the spec, the escape syntax, library defaults across Python, Node.js, and jq, and a round-trip checklist for ensuring no data loss between producer and consumer.

Got a JSON file that won't parse because of encoding weirdness — BOM bytes, mojibake, or unescaped control characters? Drop it into Jsonic's JSON Validator to see the exact byte position and a human-readable error message.

Validate JSON encoding

RFC 8259: JSON is UTF-8 by default (no other encoding allowed on the wire)

RFC 8259 §8.1, the current JSON spec, states the rule in one sentence: JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. The earlier RFC 4627 (2006) allowed UTF-8, UTF-16, and UTF-32 with a BOM-based detection scheme. RFC 7159 (2014) deprecated that scheme, and RFC 8259 (2017) made UTF-8 the only legal interchange encoding.

The "closed ecosystem" carve-out is narrow. It means: if your application writes JSON to a file and reads it back, you can use any encoding you like. The moment that file crosses to another system — a network, a tool, a different team — UTF-8 is required. In practice every modern JSON pipeline is open enough that this distinction never matters, and treating UTF-8 as a hard rule everywhere saves you from the one-day-it-broke surprise.

ECMA-404, the parallel JSON standard, agrees but states the rule more loosely: implementations are free to accept additional encodings if they want, but interoperable JSON is UTF-8. Modern parsers all default to UTF-8 and treat anything else as an error or a transcoding step.

# Identify file encoding before parsing
file --mime data.json
# data.json: application/json; charset=utf-8

# Transcode UTF-16 to UTF-8 if a legacy producer ships UTF-16
iconv -f UTF-16 -t UTF-8 legacy.json > clean.json

If a system hands you JSON in something other than UTF-8 (occasionally a Windows tool or an older Java service), transcode at the boundary. Do not try to push the non-UTF-8 bytes into a JSON parser and hope it copes.

The \uXXXX escape: when JSON parsers emit it vs raw UTF-8 bytes

JSON has two equivalent ways to represent a non-ASCII character: as raw UTF-8 bytes inside the string (the modern default), or as a \uXXXX escape — a backslash, lowercase u, and exactly four hexadecimal digits. The two forms are interchangeable on input. Every conforming parser produces the same in-memory string from either, and round-trips through different emitters may swap forms without changing meaning.

// These three JSON strings are byte-identical when parsed,
// even though the bytes on disk differ.

// 1. Raw UTF-8 (preferred — smaller, readable):
{ "name": "中文" }
// Bytes: 22 e4 b8 ad e6 96 87 22  (Chinese chars as 3-byte UTF-8 each)

// 2. \uXXXX escape (ASCII-safe):
{ "name": "\u4e2d\u6587" }
// Bytes: 22 5c 75 34 65 32 64 5c 75 36 35 38 37 22  (12 ASCII bytes)

// 3. Mixed — also valid:
{ "name": "\u4e2d文" }

When parsers emit \uXXXX:mostly when the library defaults to ASCII-safe output. Python's json.dumps escapes everything non-ASCII by default. jq --ascii-output does the same. The historical motivation was that ASCII-safe JSON survives misbehaving terminals, email systems, and tools that corrupt non-ASCII bytes — concerns that mostly disappeared by the late 2010s.

When parsers emit raw UTF-8: the modern default. Node.js JSON.stringify emits raw UTF-8. Go encoding/json emits raw UTF-8. Rust serde_json emits raw UTF-8. Python with ensure_ascii=False emits raw UTF-8. The output is smaller (one to four UTF-8 bytes per character vs six bytes per \uXXXX) and human-readable.

Both forms are fully spec-compliant. The choice is a stylistic and operational one — there is no parser that accepts one and rejects the other. For more on the broader escape rules (not just Unicode), see our JSON string escaping guide.

Surrogate pairs for codepoints above U+FFFF (😀 for 😀)

The \uXXXX escape carries exactly 16 bits, which covers the Basic Multilingual Plane (BMP) from U+0000 to U+FFFF. Codepoints above the BMP — emoji, rare CJK characters, mathematical alphanumerics — need more. JSON borrows UTF-16's surrogate-pair scheme: emit two consecutive \uXXXX escapes where the first falls in the high surrogate range \uD800–\uDBFF and the second in the low surrogate range \uDC00–\uDFFF. Parsers combine the two into the single original codepoint.

// 😀 grinning face — U+1F600

// As raw UTF-8 (4 bytes):
{ "reaction": "😀" }
// Bytes: f0 9f 98 80

// As surrogate-pair \uXXXX escapes (12 ASCII bytes):
{ "reaction": "\uD83D\uDE00" }

// Algorithm:
//   codepoint = 0x1F600
//   shifted   = codepoint - 0x10000      = 0x0F600
//   high      = (shifted >> 10) + 0xD800 = 0x03D + 0xD800 = 0xD83D
//   low       = (shifted & 0x3FF) + 0xDC00 = 0x200 + 0xDC00 = 0xDE00
//   escape    = "\uD83D\uDE00"

The surrogate range itself (U+D800 through U+DFFF) is reserved by the Unicode standard for this paired encoding. A lone surrogate (one half of a pair without its partner) is not a valid Unicode codepoint. Most JSON parsers accept lone surrogates anyway and produce a string with the unpaired codepoint — JavaScript itself permits this internally — but some strict parsers reject them. Avoid producing lone surrogates unless you specifically need them.

Common failure mode: a system reads the JSON, decodes one surrogate, then forgets to read the second before processing. The result is a string with two unpaired surrogates instead of one astral character. The fix is always at the parser — every modern library handles pairs correctly by default. See also our JSON emoji + Unicode guide for language-by-language emoji handling.

BOM (Byte Order Mark): why JSON parsers should REJECT it

The byte order mark is U+FEFF, a zero-width no-break space that, when placed at the very start of a file, encodes as the three bytes EF BB BF in UTF-8 and signals the file's encoding to text editors. RFC 8259 §8.1 forbids it in JSON output: Implementations MUST NOT add a byte order mark (U+FEFF) to the output of a JSON text. On input, ECMA-404 says parsers MAY ignore one, but the modern consensus has hardened toward rejection.

// Detect and strip BOM in Python
def load_json_strip_bom(path):
    with open(path, 'rb') as f:
        raw = f.read()
    if raw.startswith(b'\xef\xbb\xbf'):
        raw = raw[3:]
    return json.loads(raw.decode('utf-8'))

// Detect and strip BOM in Node.js
import { readFileSync } from 'node:fs'

function loadJsonStripBom(path) {
  let text = readFileSync(path, 'utf-8')
  if (text.charCodeAt(0) === 0xFEFF) {
    text = text.slice(1)
  }
  return JSON.parse(text)
}

Where BOM comes from: PowerShell's Out-File and Set-Contentdefault to UTF-8-with-BOM on Windows PowerShell 5.x. Some older Microsoft tools (Excel CSV export, Notepad before Windows 10 1903) write BOM automatically. A few Java libraries do too. The fix is at the producer — change the encoding setting to UTF-8-no-BOM — or strip the prefix at the consumer if you can't control the producer.

Parser behavior survey: Node.js JSON.parse throws on BOM. Python json.loads throws (the BOM appears as U+FEFF at the start and json fails to find a valid token there). Go encoding/json rejects. jq rejects. The browser fetch().json() rejects. Treat BOM as broken input and fix it upstream.

Content-Type: application/json — charset parameter is forbidden

The IANA media type registration for application/json explicitly lists no required or optional parameters. RFC 8259 §11 confirms this — the only legal Content-Type header for JSON is application/json with nothing after it. A charset parameter is technically a spec violation because the encoding is already fixed by the JSON format itself.

// CORRECT — spec-conformant
Content-Type: application/json

// COMMON BUT TECHNICALLY WRONG — every client tolerates it,
// but it's not what the spec says to send
Content-Type: application/json; charset=utf-8

// Express.js — res.json() ships the correct header by default
res.json({ ok: true })
// → Content-Type: application/json; charset=utf-8  ← Express adds charset

// To strip the charset parameter:
res.set('Content-Type', 'application/json').send(JSON.stringify({ ok: true }))

Why does it not matter much in practice? Every browser, every HTTP client (axios, fetch, requests, curl, OkHttp), and every API gateway parses both forms identically. The charset=utf-8 parameter is redundant — the body bytes are UTF-8 either way — but it does not break anything. Express.js, Spring Boot, and several other server frameworks add it by default. Most teams leave it alone.

Where it does matter: API contract validation tools (some OpenAPI linters flag the non-standard parameter), strict CORS preflight checks, and anything that byte-compares the header string. If you need the exact-spec value, strip the parameter at the framework layer. For more on the broader response-header story around JSON APIs, see our Content-Type handling guide.

Round-trip safety: ensuring no data loss across encode/decode cycles

A round-trip test takes a JSON value, serializes it, parses the result, and compares the parsed value to the original. Round-trip safety is the property that every encode/decode cycle produces an identical in-memory representation. Encoding bugs almost always show up as a round-trip failure: a string with an emoji becomes two surrogate halves, a BOM gets added or stripped, key ordering changes, a number loses precision.

// Python round-trip test
import json

def assert_roundtrip(value):
    serialized = json.dumps(value, ensure_ascii=False, sort_keys=True)
    parsed = json.loads(serialized)
    reserialized = json.dumps(parsed, ensure_ascii=False, sort_keys=True)
    assert serialized == reserialized, f"Round-trip mismatch:\n{serialized}\n{reserialized}"

assert_roundtrip({"emoji": "😀", "chinese": "中文", "arabic": "مرحبا"})
assert_roundtrip({"surrogate_pair": "\uD83D\uDE00"})  # same as 😀 above
assert_roundtrip({"control_chars": "tab\there"})

// Node.js round-trip test
function assertRoundtrip(value) {
  const serialized = JSON.stringify(value)
  const parsed = JSON.parse(serialized)
  const reserialized = JSON.stringify(parsed)
  if (serialized !== reserialized) {
    throw new Error(`Round-trip mismatch:\n${serialized}\n${reserialized}`)
  }
}

Sources of round-trip failure (in rough order of frequency):

  • Mixing ensure_ascii=True on one side and =False on the other — the bytes differ even though the parsed values agree, which trips byte-comparison tests.
  • BOM addition or stripping at the file layer.
  • Number precision loss for integers above 2^53 (see our BigInt guide).
  • Key reordering — strict comparison against a serialized fixture breaks if the emitter sorts keys differently from the producer.
  • Normalization-form mismatch — NFC vs NFD on accented characters, where producer and consumer disagree on whether é is one codepoint or two.
  • Whitespace differences — pretty-printed vs minified output of the same value.

For systems that need byte-identical output (signature verification, content hashing, deduplication), the answer is JSON canonicalization — a deterministic encoding spec. See our JSON canonicalization guide for the JCS spec and its tradeoffs.

Library behavior: ensure_ascii (Python), encodeForUTF8 (Java), -ascii (jq)

Encoding bugs cluster around language defaults. The same data emitted from Python, Node.js, and Java can produce three different byte sequences for the same logical JSON — all valid, all parsing back to the same in-memory value, but byte-different and therefore broken under signature verification or strict caching.

Language / ToolDefault behaviorHow to emit raw UTF-8
Python json.dumpsensure_ascii=True — escapes non-ASCIIjson.dumps(x, ensure_ascii=False)
Node.js JSON.stringifyRaw UTF-8 — no escaping(default)
Go encoding/jsonRaw UTF-8, but escapes <, >, & for HTML safetyenc.SetEscapeHTML(false)
Rust serde_jsonRaw UTF-8 — no escaping(default)
Java JacksonRaw UTF-8 — no escaping(default; FORCE_ESCAPE_NON_ASCII off)
jqRaw UTF-8 — no escaping(default; --ascii-output for legacy)
PHP json_encodeEscapes non-ASCII by defaultjson_encode(x, JSON_UNESCAPED_UNICODE)
# Python — gotcha
>>> import json
>>> json.dumps({"name": "张伟"})
'{"name": "\u5f20\u4f1f"}'                          # ASCII-only by default

>>> json.dumps({"name": "张伟"}, ensure_ascii=False)
'{"name": "张伟"}'                                    # raw UTF-8

# Node.js — does the right thing by default
> JSON.stringify({ name: "张伟" })
'{"name":"张伟"}'

# jq — defaults vs --ascii-output
$ echo '{"name":"张伟"}' | jq .
{ "name": "张伟" }                                    # raw UTF-8
$ echo '{"name":"张伟"}' | jq --ascii-output .
{ "name": "\u5f20\u4f1f" }                          # escaped

# PHP
echo json_encode(["name" => "张伟"]);                  // "name":"\u5f20\u4f1f"
echo json_encode(["name" => "张伟"], JSON_UNESCAPED_UNICODE);  // "name":"张伟"

The pattern: languages whose defaults date from the late 2000s tend to ASCII-escape by default (Python, PHP). Languages whose defaults were set after UTF-8 became universal emit raw UTF-8 (Node.js, Go, Rust, Jackson). The right choice for new code is raw UTF-8 — smaller, readable, faster to write and parse.

Common bugs: double-encoded JSON, mojibake, BOM-prefixed files

Three encoding bugs come up repeatedly in production. All three are storage and transport problems, not JSON-spec problems — the JSON itself is fine, the bytes around it are wrong.

Bug 1: Double-encoded UTF-8. The producer writes correct UTF-8 bytes. A downstream tool interprets those bytes as Latin-1 and re-encodes them as UTF-8 again. The result: each non-ASCII character now takes four bytes instead of two, and the visible text shows mojibake like é where é should appear. Detect with file or by inspecting bytes — if you see the UTF-8 sequence for U+00C3followed by another high-byte sequence where one character should be, that's double encoding.

# Inspect bytes
$ xxd data.json | head -1
00000000: 7b 22 6e 22 3a 22 c3 83 c2 a9 22 7d        {"n":"......"}
# Should be:  c3 a9  (é as UTF-8)
# Instead:    c3 83 c2 a9  (double-encoded é)

# Fix in Python
broken = open('data.json', encoding='utf-8').read()
fixed = broken.encode('latin-1').decode('utf-8')
open('fixed.json', 'w', encoding='utf-8').write(fixed)

Bug 2: Mojibake. The bytes are correct UTF-8, but a viewer or downstream system interprets them as Latin-1, Windows-1252, or Shift-JIS. The user sees garbled text but the file itself is fine. The fix is at the consumer: set the encoding explicitly.

Bug 3: BOM-prefixed files. Covered above — a producer (often PowerShell or an older Microsoft tool) prepends EF BB BF, and the parser rejects the file. Strip the prefix at the consumer or fix the producer's encoding setting.

All three are diagnosable in under a minute with xxd, file --mime, or any hex editor. The mistake teams make is debugging in their JSON parser instead of looking at the raw bytes — the JSON layer always tells you "invalid token at position N", which is true but unhelpful when the real problem is two layers down at the byte stream.

Key terms

UTF-8
A variable-width Unicode encoding: ASCII characters take 1 byte, Latin/European characters 2 bytes, most CJK and other BMP characters 3 bytes, astral-plane characters (emoji) 4 bytes. The mandatory encoding for JSON interchange per RFC 8259 §8.1.
\uXXXX escape
JSON's syntax for representing a Unicode codepoint as four hexadecimal digits preceded by \u. Encodes any BMP codepoint (U+0000 to U+FFFF). For codepoints above U+FFFF, two \uXXXX escapes form a surrogate pair.
BMP (Basic Multilingual Plane)
The first 65,536 Unicode codepoints, U+0000 to U+FFFF. Contains most living-language characters. Codepoints above the BMP — emoji, rare CJK, mathematical alphanumerics — are called "astral" or "supplementary" and require surrogate pairs in JSON's escape syntax.
surrogate pair
Two consecutive \uXXXX escapes that together encode a single codepoint above U+FFFF. The high surrogate lies in \uD800–\uDBFF and the low surrogate in \uDC00–\uDFFF. JSON parsers combine them into the original codepoint on read.
BOM (Byte Order Mark)
The Unicode character U+FEFF (encoded as EF BB BF in UTF-8) sometimes placed at the start of a text file to signal its encoding. RFC 8259 forbids BOM in JSON output; most parsers reject BOM-prefixed input.
mojibake
Garbled text produced when a string of bytes encoded in one encoding is displayed under a different encoding — for example, UTF-8 bytes shown as Windows-1252 produce sequences like é where é should appear.
ensure_ascii
A keyword argument to Python's json.dumps. When True (the default), non-ASCII characters are emitted as \uXXXX escapes. When False, they pass through as raw UTF-8 bytes. The same idea appears in PHP as JSON_UNESCAPED_UNICODE and in jq as --ascii-output.

Frequently asked questions

Is JSON always UTF-8?

On the wire, yes. RFC 8259 §8.1 states that JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. The earlier RFC 4627 allowed UTF-16 and UTF-32 with a BOM-based detection scheme, but RFC 7159 deprecated that and RFC 8259 made UTF-8 the only legal interchange encoding. Inside a closed system (an application talking to its own files) you can technically store JSON in any encoding you like, but the moment it crosses a network or a tool boundary it must be UTF-8. Parsers are not required to support anything else, and most modern ones (Node.js JSON.parse, Python json.loads, jq, browser fetch responses) assume UTF-8 by default and will produce garbage or errors on UTF-16/UTF-32 input. If a system gives you JSON in a different encoding, transcode to UTF-8 before parsing.

Can JSON contain raw emoji and Unicode characters?

Yes. RFC 8259 explicitly allows any Unicode codepoint in a string except the control characters U+0000 through U+001F (which must be escaped) and the two characters quotation mark and reverse solidus (which must be escaped or \uXXXX-escaped). Emoji, Chinese, Arabic, Devanagari, mathematical symbols — all of these are legal as raw UTF-8 bytes inside a JSON string. The string &quot;message&quot;: &quot;Hello 你好 😀&quot; is a perfectly valid JSON value. The \uXXXX escape syntax exists as an alternative encoding, not a requirement. Some libraries default to escaping non-ASCII characters anyway (Python json.dumps does this, jq has --ascii-output) to produce JSON that survives ASCII-only pipelines, but the resulting file means the same thing as the raw-UTF-8 version. Parsers must accept both forms and produce the same in-memory string.

What's the \uXXXX escape syntax in JSON?

\uXXXX is JSON&apos;s escape syntax for any Unicode codepoint in the Basic Multilingual Plane (U+0000 through U+FFFF). The four Xs are exactly four hexadecimal digits — uppercase or lowercase both work. So \u0041 is the letter A, \u4e2d is the Chinese character 中, and \u00e9 is é. For codepoints above U+FFFF (emoji, rare CJK characters, mathematical alphanumerics), JSON requires a surrogate pair: two \uXXXX escapes where the first falls in the range \uD800–\uDBFF (high surrogate) and the second in \uDC00–\uDFFF (low surrogate). The emoji 😀 (U+1F600) becomes \uD83D\uDE00. This is the same UTF-16 encoding scheme JavaScript uses internally, which is why every JSON parser handles it natively. A lone unpaired surrogate is technically invalid JSON, though many parsers accept it.

Why does Python json.dumps escape my Chinese characters by default?

Because json.dumps has ensure_ascii=True as the default. With that setting, every non-ASCII character is replaced with its \uXXXX escape — so {&quot;name&quot;: &quot;张伟&quot;} becomes {&quot;name&quot;: &quot;\u5f20\u4f1f&quot;}. The result is still valid JSON and parses back to the same string, but it is unreadable in a text editor and inflates the byte count roughly six-to-one for CJK text. Pass ensure_ascii=False to emit the characters as raw UTF-8 bytes: json.dumps(data, ensure_ascii=False). The historical rationale was that ASCII-safe JSON survives email, terminals, and tools that mishandle non-ASCII bytes, but in 2026 every reasonable JSON pipeline handles UTF-8 fine. Most teams set ensure_ascii=False as a project default. Node.js JSON.stringify has the opposite default — it emits raw UTF-8 — so the same data round-tripped through both languages looks different unless you align the settings.

Should I include charset=utf-8 in the Content-Type header?

No. The IANA registration for application/json explicitly states that the media type has no required or optional parameters — charset is forbidden. RFC 8259 §11 confirms this: the only legal Content-Type for JSON is application/json with no parameters. The reason is that JSON&apos;s encoding is already fixed by the format itself (UTF-8, per §8.1), so a charset parameter is redundant at best and contradictory at worst. Some servers ship Content-Type: application/json; charset=utf-8 anyway, and clients tolerate it — browsers, fetch, axios, requests all parse the body fine — but it is technically a violation of the spec. The strict-correct value is just application/json. The only place you may legitimately need a charset parameter is when serving JSON under a different media type (text/plain for debug, or a custom +json type), where the parent type does allow charset.

Is BOM allowed in JSON files?

No. RFC 8259 §8.1 is explicit: implementations MUST NOT add a byte order mark (U+FEFF) to the output of JSON text, and §1.5 of the ECMA-404 standard says JSON parsers MAY ignore one but are not required to. In practice, the consensus has hardened toward rejection: Node.js JSON.parse throws on a BOM-prefixed string, Python json.loads throws (the BOM appears as U+FEFF at the start of the input and json fails to parse it as a token), Go encoding/json rejects it, and jq fails with &quot;parse error&quot;. The PowerShell ConvertTo-Json on Windows is the most common source of BOM-prefixed JSON files because PowerShell&apos;s Out-File defaults to UTF-8-with-BOM. Strip the first three bytes (EF BB BF) before parsing, or save without BOM. Modern editors (VS Code, Sublime, vim) default to no-BOM UTF-8 for JSON files.

How do surrogate pairs work in JSON?

Surrogate pairs are JSON&apos;s way of encoding codepoints above U+FFFF using only \uXXXX escapes, which by themselves can express only 16 bits. The pair scheme borrows from UTF-16: split the codepoint above U+FFFF into a high surrogate (U+D800–U+DBFF) and a low surrogate (U+DC00–U+DFFF), then emit both as consecutive \uXXXX escapes. For 😀 (U+1F600), subtract 0x10000 to get 0xF600, split into two 10-bit halves (0x3D and 0x200), add 0xD800 and 0xDC00 respectively to get 0xD83D and 0xDE00, and emit \uD83D\uDE00. Every JSON parser combines the two back into the single codepoint when reading. Raw UTF-8 bytes work too — most modern emitters prefer them since they read better and produce smaller output — but the surrogate-pair escape form remains valid and is what ASCII-only emitters (Python ensure_ascii=True, jq --ascii-output) produce.

Why does my UTF-8 JSON file display as garbage in some viewers?

Three common causes. First, the viewer is interpreting the bytes as the wrong encoding — typically Windows-1252 or Latin-1 — so multi-byte UTF-8 sequences appear as sequences of accented characters (the famous mojibake like é instead of é). Set the editor to UTF-8 or save and reopen. Second, the file has been double-encoded: the producer wrote UTF-8 bytes but a downstream tool then re-encoded those bytes as if they were Latin-1, producing nonsense. Inspect the raw bytes with xxd or a hex editor — if you see the bytes c3 83 c2 a9 where é should be (two two-byte sequences instead of one), that&apos;s double-encoded UTF-8. Third, the file has a BOM and the viewer treats the BOM as visible characters. Strip the leading EF BB BF. All three are storage and transport bugs, not JSON-spec bugs — the JSON itself is fine, the bytes around it are not.

Further reading and primary sources