Emoji and Unicode in JSON: Storage, Length Counting, ZWJ Sequences, and Skin Tones
Last updated:
Emoji in JSON are not a special data type — they are just Unicode codepoints inside ordinary string values. The trouble starts everywhere downstream: JavaScript's string.lengthcounts UTF-16 code units and returns 8 for a family emoji, MySQL's legacy utf8 charset silently drops anything above U+FFFF, JSON Schema validators disagree on whether maxLength means codepoints or code units, and naive regex character classes mangle surrogate pairs without complaint. This guide walks through the four layers an emoji passes through — JSON text, in-memory string, database column, validation rule — and shows what breaks at each layer and what to write instead. Examples target JavaScript and SQL because that is where the bug reports actually file, but the rules — UTF-8 transport, grapheme-aware counting, 4-byte-safe storage,u-flag regex — apply to every language.
Got a JSON file with emoji that won't parse, or escapes that look mangled? Paste it into Jsonic's JSON Validator — it flags broken surrogate pairs, bad \u escapes, and encoding issues with exact line numbers.
Why emoji break string.length: surrogate pairs in JSON
JSON string values are sequences of Unicode codepoints, but JavaScript stores strings as UTF-16 — a 16-bit encoding that can only represent codepoints up to U+FFFF in a single code unit. Anything above that range (the supplementary planes where every modern emoji lives) is encoded as a surrogate pair: a high surrogate in the range U+D800–U+DBFF followed by a low surrogate in U+DC00–U+DFFF. The grinning face emoji U+1F600 becomes the pair \uD83D\uDE00 in UTF-16.
The .length property counts code units, not codepoints. So "😀".length returns 2, not 1 — and most people see this for the first time when a maxLength check on a user comment field rejects an emoji that looks single-character to the user. The same string has 4 UTF-8 bytes on the wire, 1 codepoint, and 1 grapheme.
// Same emoji, four different counts
const e = "😀" // U+1F600 GRINNING FACE
e.length // 2 — UTF-16 code units
[...e].length // 1 — codepoints (iterator splits per codepoint)
new TextEncoder().encode(e).length // 4 — UTF-8 bytes
new Intl.Segmenter().segment(e) // 1 segment — graphemesJSON itself does not see surrogates as a special case — they are just two adjacent \uXXXX escapes when emitted in escaped form. The trap is that an unpaired surrogate (a high without a low, or vice versa) is invalid Unicode and most strict parsers will reject it. Validators occasionally see this when a user-provided string is truncated to a fixed character budget mid-pair.
Storing emoji in JSON: native UTF-8 vs \uXXXX escapes
The JSON spec (RFC 8259) allows any Unicode codepoint in string content in one of two forms: as native UTF-8 bytes when the file is UTF-8 encoded (the default and recommended), or as backslash-u escapes. Emoji above U+FFFF use a pair of escapes, one per surrogate. Both forms parse to the same string.
// Form 1: native UTF-8 — readable, shorter, recommended
{
"reaction": "🔥",
"user": "Alice 👋",
"family": "👨👩👧"
}
// Form 2: \uXXXX escapes — pure ASCII output, identical semantics
{
"reaction": "\uD83D\uDD25",
"user": "Alice \uD83D\uDC4B",
"family": "\uD83D\uDC68\u200D\uD83D\uDC69\u200D\uD83D\uDC67"
}When to choose each form. Native UTF-8 wins on readability, file size, and diffability — you can grep for the emoji, and a code review actually shows the emoji. Use the escape form when output must be pure ASCII: log pipelines that cannot handle high bytes, transports that downgrade to ASCII for safety, or when emitting JSON inside a context that has its own escape layer (HTML attributes, shell strings, CSV cells) and you want to skip an encoding step.
Most JSON encoders default to native UTF-8 emission. To force escapes, look for an option named ensure_ascii (Python's json.dumps) or asciiSafeJsonStringify in newer libraries. Node's built-in JSON.stringify always emits native characters — escape any non-ASCII manually if you need it.
See our JSON UTF-8 encoding guide for the encoding layer in depth, and JSON string escaping for the full table of valid escape sequences.
ZWJ sequences: family emoji are 7+ codepoints joined by U+200D
A zero-width joiner sequence is a chain of emoji codepoints glued together by U+200D (ZERO WIDTH JOINER) so renderers present them as a single combined glyph. The codepoint U+200D has zero width and renders invisibly when no joining rule applies — but between two emoji that the font knows how to combine, it triggers a merged glyph: man + ZWJ + woman + ZWJ + girl becomes 👨👩👧.
// Family of three: man, woman, girl
// 5 codepoints, 8 UTF-16 code units, 18 UTF-8 bytes, 1 grapheme
"👨" U+1F468 MAN
"" U+200D ZERO WIDTH JOINER
"👩" U+1F469 WOMAN
"" U+200D ZERO WIDTH JOINER
"👧" U+1F467 GIRL
// As JSON escapes:
"\uD83D\uDC68\u200D\uD83D\uDC69\u200D\uD83D\uDC67"ZWJ sequences power three categories: family emoji (any combination of adults and children), profession emoji (person + ZWJ + object — woman + ZWJ + laptop → 🧑💻 female tech worker), and identity emoji (rainbow flag, transgender flag, pirate flag). The Unicode CLDR ships emoji-zwj-sequences.txt listing every sequence font vendors are expected to support.
Graceful degradation: a renderer that does not recognize a ZWJ sequence still produces a sensible result — it just shows the individual emoji side by side with the ZWJs invisible. So a custom or rare sequence appears as separate glyphs but never as a missing-character box. This is by design and means new sequences can roll out without breaking old systems.
Skin tone modifiers: Fitzpatrick scale + base emoji
Skin tone modifiers are five codepoints from the Fitzpatrick dermatology scale — U+1F3FB through U+1F3FF (types 1-2 through 6) — that you place immediately after a person, hand, or body-part emoji to tint it. In JSON they are just two adjacent codepoints in the string: a thumbs-up with medium skin tone is U+1F44D (👍) followed by U+1F3FD, written natively as "👍🏽".
// Skin tone modifier table (Fitzpatrick scale)
U+1F3FB 🏻 Light (type 1-2)
U+1F3FC 🏼 Medium-light (type 3)
U+1F3FD 🏽 Medium (type 4)
U+1F3FE 🏾 Medium-dark (type 5)
U+1F3FF 🏿 Dark (type 6)
// Base + modifier in JSON
{ "wave": "👋🏽" }
// = U+1F44B WAVING HAND + U+1F3FD MEDIUM SKIN TONE
// = 2 codepoints, 4 UTF-16 code units, 8 UTF-8 bytes, 1 grapheme
// Escaped form
{ "wave": "\uD83D\uDC4B\uD83C\uDFFD" }Modifiers combine with ZWJ sequences to multiply codepoint counts. A family of two adults of different tones plus a child is a long string — base + tone + ZWJ + base + tone + ZWJ + base + tone. Eight codepoints, sixteen UTF-16 code units, one grapheme. Counting bugs at this layer are the single most common emoji ticket in consumer apps that allow free-form text.
A few base emoji do not accept skin tone modifiers — most notably the generic person silhouettes that already represent any tone (the yellow default). When a modifier follows an emoji that does not support it, the modifier renders as its own little color swatch glyph next to the unmodified emoji. This is rare and almost always a data-entry mistake.
Counting graphemes: Intl.Segmenter, grapheme-splitter, segments-iterator
For any user-facing length check — a character budget, a validation message, a cursor position — count graphemes, not codepoints or code units. JavaScript's built-in answer is Intl.Segmenter with granularity "grapheme", which implements the Unicode UAX #29 grapheme cluster rules. It collapses ZWJ sequences, skin tone pairs, regional indicator flags, and combining marks all into single segments.
// Modern, built-in (Node 16+, Chrome 87+, Safari 14.1+, Firefox 125+)
function countGraphemes(str) {
const seg = new Intl.Segmenter("en", { granularity: "grapheme" })
return [...seg.segment(str)].length
}
countGraphemes("a") // 1
countGraphemes("😀") // 1
countGraphemes("👍🏽") // 1 (base + skin tone)
countGraphemes("👨👩👧") // 1 (ZWJ family)
countGraphemes("🇯🇵") // 1 (regional indicator pair = flag)
countGraphemes("Hello 👋🏽!") // 8For older runtimes the grapheme-splitter library has been the de-facto polyfill for years. It is a single-file pure-JS implementation of the same UAX #29 rules and the results match Intl.Segmenter for the entire emoji range.
// grapheme-splitter polyfill — for runtimes without Intl.Segmenter
import GraphemeSplitter from 'grapheme-splitter'
const splitter = new GraphemeSplitter()
splitter.countGraphemes("👨👩👧") // 1
splitter.splitGraphemes("Hi 👋🏽") // ["H", "i", " ", "👋🏽"]What not to use: str.length (UTF-16 code units), Array.from(str).length (codepoints — still splits a family emoji into 5), and any regex like /./gufor counting (matches per codepoint, not per grapheme). For server-side JSON validation in non-JS environments, ICU'sBreakIterator or Python's regex module with \X give equivalent grapheme behavior.
Database storage: utf8mb4 vs utf8 (MySQL), CHAR vs VARCHAR sizing
MySQL has two charsets named utf8: utf8 (alias for utf8mb3) is a non-standard 3-byte-maximum UTF-8 that cannot encode anything above U+FFFF — so every modern emoji is impossible to store. utf8mb4 is the real, 4-byte-safe UTF-8 and has shipped since MySQL 5.5.3 (March 2010). Always use utf8mb4. There is no scenario in 2026 where picking plain utf8 is the right call.
-- MySQL: create an emoji-safe column
CREATE TABLE messages (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
body TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci
);
-- Convert an existing legacy table
ALTER TABLE messages
CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
-- Set the connection too (or set in my.cnf / driver config)
SET NAMES utf8mb4;Column sizing. Under utf8mb4, VARCHAR(n) reserves space for n characters (codepoints), not n bytes — but the underlying row limit is in bytes and each character costs up to 4 bytes. A VARCHAR(255) under utf8mb4 takes up to 1020 bytes of row space, which matters when packing many columns into one row. Plan emoji-bearing columns as TEXT or VARCHAR(255) at most, and resist the temptation to migrate older VARCHAR(1000) columns to utf8mb4 without also widening the row.
Other databases: PostgreSQL TEXT, VARCHAR, JSON, and JSONB all accept full UTF-8 by default — see our PostgreSQL JSON guide. SQLite is UTF-8 everywhere. SQL Server needs NVARCHAR (not VARCHAR) and N-prefixed string literals: N'Hello 👋'. JSON in MongoDB covers BSON storage where strings are always UTF-8.
Validating emoji-only inputs (regex pitfalls and the modern way)
The right way to match emoji codepoints with regex is Unicode property escapes under the u (or modern v) flag: /\p{Emoji}/u matches any codepoint that has the Emoji property, and /\p{Emoji_Presentation}/u is stricter — it matches only codepoints that render as emoji by default (excluding digits and letters that have emoji variants).
// Modern: Unicode property escapes with the u flag
const hasEmoji = /\p{Emoji}/u.test("Hello 👋") // true
const onlyEmoji = /^\p{Emoji_Presentation}+$/u.test("🔥") // true
// ES2024 v flag with set operations
const emojiNotAscii = /[\p{Emoji}--\p{Ascii}]/v.test("1") // false (1 is ASCII)
// The classic trap — DON'T do this
const bad = /[😀-🙏]/ // No u flag: the engine sees surrogate code
// units and the range becomes meaningless
const good = /[\u{1F600}-\u{1F64F}]/u // With u flag: correct codepoint rangeThe pitfall that catches everyone is writing a character class that ranges across emoji without the u flag. Without it, the engine sees each high-plane emoji as two adjacent UTF-16 code units, and the range syntax silently expands to cover surrogate code units rather than codepoints. The regex compiles, runs, and produces wrong results — no warning. Always include u.
Whole-grapheme matching (matching a ZWJ family as one unit) is still hard with raw regex — no single Unicode property covers an entire ZWJ sequence. The reliable approach is to segment the string with Intl.Segmenter first, then run regex tests against each grapheme segment. For JSON Schema validation of emoji-only fields, this means the schema cannot fully express the rule — you enforce it in application code after the schema-level check.
Common bugs: ::text → varchar truncation, emoji indexing, JSON Schema length
Three recurring bug shapes in emoji-bearing JSON pipelines are worth memorizing because they account for most of the production tickets in this area.
1. PostgreSQL cast truncation. Casting a wider type to VARCHAR(n) silently truncates at n codepoints. If a JSONB field extracted as ::text then cast to VARCHAR(280)hits exactly 280 codepoints but the 280th is a high surrogate's codepoint, you can still end up with a half-emoji in the column. Always size truncation targets generously and prefer LEFT(value, n) with explicit grapheme-safe logic at the application layer.
2. Index keys with emoji. JavaScript object keys and JSON Pointer paths technically allow any Unicode in keys, but many tooling layers (URL encoding, shell variables, environment variables, query string keys) do not survive 4-byte UTF-8 in identifier-like positions. Treat emoji-in-keys as values-only by policy — use ASCII or normalized identifiers as keys, put the emoji in the value.
3. JSON Schema length validators. The spec (draft 2019-09 and later) defines minLength/maxLength as counting Unicode codepoints. Older validators count UTF-16 code units. Ajv has been codepoint-correct since 7.x; some legacy stacks have not caught up. The result is the same emoji passing validation in one service and failing in another — a hard bug to triage. Pin validator versions and write a test that asserts a family emoji passes a maxLength: 5 (5 codepoints) but fails maxLength: 4. For grapheme-perceived limits, enforce in application code; JSON Schema has no grapheme-aware length keyword. See JSON Schema string formats for related length and format rules, and JSON i18n for the broader locale and language story.
Key terms
- codepoint
- A single Unicode scalar value, written U+XXXX or U+XXXXX. The grinning face emoji is U+1F600. JSON Schema length validators count codepoints.
- surrogate pair
- The UTF-16 encoding of a codepoint above U+FFFF: a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). Each emoji above U+FFFF takes one surrogate pair (two UTF-16 code units, two
\uXXXXescapes in JSON). - grapheme cluster
- The user-perceived character — one or more codepoints that render as a single visible unit. A ZWJ family emoji is one grapheme made of 5+ codepoints. Defined by Unicode UAX #29; computed in JavaScript via
Intl.Segmenter. - ZWJ sequence
- A chain of emoji codepoints joined by U+200D (ZERO WIDTH JOINER) that fonts render as a single combined glyph — family emoji, profession emoji, identity flags.
- Fitzpatrick skin tone modifier
- One of five codepoints U+1F3FB through U+1F3FF placed after a person, hand, or body-part emoji to tint it. Forms one grapheme with the base emoji.
- utf8mb4
- MySQL's real 4-byte-safe UTF-8 charset, shipped since 5.5.3 (March 2010). The legacy
utf8(alias forutf8mb3) is 3-byte-capped and cannot store any emoji. - Intl.Segmenter
- The built-in JavaScript API for splitting strings into graphemes, words, or sentences per Unicode rules. Available in Node 16+, Chrome 87+, Safari 14.1+, Firefox 125+.
Frequently asked questions
How do I store emoji in a JSON field?
Write the file as UTF-8 without a BOM and paste the emoji directly into the string value — JSON natively supports any Unicode codepoint in string content. {"reaction": "🔥"} is valid JSON. The other legal form is a backslash-u escape, which spells the emoji as one or two surrogate code units: {"reaction": "\uD83D\uDD25"}. Both round-trip identically through any conforming parser. Choose native UTF-8 when humans will read the file (it is shorter and recognizable), and choose escapes when you need pure ASCII output for log pipelines, legacy systems, or transports that strip high bytes. On the storage side, make sure the database column accepts 4-byte UTF-8: utf8mb4 on MySQL, the default UTF-8 on PostgreSQL and SQLite, NVARCHAR on SQL Server. If you store the raw JSON text in a TEXT column, the column charset alone decides whether the emoji survives.
Why does '👨👩👧'.length return 8 in JavaScript?
JavaScript strings are sequences of 16-bit UTF-16 code units, not codepoints or graphemes — and .length counts code units. The family emoji 👨👩👧 is a zero-width joiner sequence built from three base emoji glued together with U+200D: man (U+1F468), ZWJ (U+200D), woman (U+1F469), ZWJ (U+200D), girl (U+1F467). Each of the three person emoji lives outside the Basic Multilingual Plane, so each one takes two UTF-16 code units (a surrogate pair) — that is 6 code units for the people. The two ZWJs are each one code unit, adding 2. Total: 8. The human-perceived count is 1, the codepoint count is 5, the UTF-16 code unit count is 8, and the UTF-8 byte count is 18. Use Intl.Segmenter with granularity "grapheme" when you want the human answer.
How do I count emoji as one character?
Use Intl.Segmenter — it is built into V8 and JavaScriptCore and has been available in Node 16+, Chrome 87+, Safari 14.1+, and Firefox 125+. Create a segmenter with granularity "grapheme" and call segment(): const seg = new Intl.Segmenter("en", { granularity: "grapheme" }); const count = [...seg.segment("👨👩👧")].length; // 1. Grapheme segmentation follows the Unicode UAX #29 rules — ZWJ sequences, regional indicator pairs (flags), Fitzpatrick skin tone modifiers, combining marks, and Hangul syllables all collapse to one segment. If you need the same logic in older runtimes, the grapheme-splitter library has been the de-facto polyfill for years and gives matching results. Avoid Array.from(string).length — that counts codepoints, not graphemes, and still splits a family emoji into 5 pieces.
Why are emoji corrupted when stored in MySQL?
Almost always because the column, table, or connection is using MySQL’s old utf8 charset instead of utf8mb4. MySQL’s legacy utf8 charset is a non-standard 3-byte-maximum UTF-8 — it cannot encode anything above U+FFFF, which excludes every modern emoji (they all live in supplementary planes). When the driver pushes a 4-byte UTF-8 sequence into a utf8 column the server either rejects the insert with an "Incorrect string value" error or silently truncates the row at the bad byte. The fix is to use utf8mb4 everywhere: ALTER TABLE messages CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; and set character_set_client, character_set_connection, and character_set_results to utf8mb4 on the connection. utf8mb4 has shipped in MySQL since 5.5.3 (March 2010) — there is no reason to ever pick plain utf8 today.
What is a ZWJ sequence?
A zero-width joiner sequence is a chain of emoji codepoints glued together by U+200D (ZERO WIDTH JOINER) so that font rendering presents them as a single combined glyph. The classic examples are family emoji (man + ZWJ + woman + ZWJ + girl renders as 👨👩👧), profession emoji (woman + ZWJ + laptop renders as a female tech worker), and identity emoji (rainbow flag, transgender flag). Each ZWJ-joined emoji can itself be several codepoints — a person with a skin tone modifier is base + modifier, then the ZWJ chain continues from there. The Unicode CLDR ships an emoji-zwj-sequences.txt file listing every sequence that font vendors are expected to support. Sequences that no font knows about still render — they just appear as the individual emoji laid out side by side, ZWJs invisible, no combined glyph.
How do skin tone modifiers work in JSON?
Skin tone modifiers are five Fitzpatrick-scale codepoints — U+1F3FB (light), U+1F3FC (medium-light), U+1F3FD (medium), U+1F3FE (medium-dark), U+1F3FF (dark) — that you place immediately after a person, hand, or body-part emoji to tint it. In JSON the string just contains both codepoints in order: "👍🏽" is the thumbs-up emoji (U+1F44D) followed by the medium tone modifier (U+1F3FD), and a renderer that supports the combination shows them as one glyph. As an escape sequence the same string is "\uD83D\uDC4D\uD83C\uDFFD" (two surrogate pairs, one per codepoint). Combined with ZWJ sequences, skin tones can appear multiple times — a family of two adults of different tones is a long sequence. For storage and counting, treat the modifier as part of the preceding grapheme — Intl.Segmenter does this correctly.
Can I use regex to match emoji in JSON?
Use Unicode property escapes with the u (or v) flag: /\p{Emoji}/u matches any single codepoint with the Emoji property. \p{Emoji_Presentation} is stricter — it matches only codepoints that render as emoji by default (not the textual digits or letters that have emoji variants). The v flag (ES2024) adds set operations and is the modern recommendation: /[\p{Emoji}--\p{Ascii}]/v matches emoji codepoints that are not plain ASCII. The trap that catches everyone: a naive character class like /[😀-🙏]/ silently fails in non-u regex because the engine treats the high-plane emoji as surrogate pairs and the range becomes garbage. Always include the u flag. For ZWJ sequences and skin-tone-modified emoji, no regex matches the whole grapheme cleanly — segment the string with Intl.Segmenter first and test segments individually.
How do JSON Schema string length validators count emoji?
JSON Schema defines minLength and maxLength as counting Unicode codepoints, per draft 2019-09 and later — not UTF-16 code units, not UTF-8 bytes, not graphemes. A family emoji 👨👩👧 has 5 codepoints, so it consumes 5 against a maxLength budget. Older validators built before this clarification (and some still-popular libraries that did not catch up) instead measure str.length, which counts UTF-16 code units and gives 8 for the same emoji. The behavior matters when limits are tight: a "tweet-like" maxLength: 280 with a strict draft-2019-09 validator allows 280 codepoints (potentially many more emoji than a 280-code-unit limit would). Always check which behavior your validator uses — Ajv has been codepoint-accurate since 7.x. If you need grapheme-based limits (the user-perceived count), you must enforce them in application code with Intl.Segmenter; JSON Schema itself has no grapheme-aware length keyword.
Further reading and primary sources
- RFC 8259 — The JSON Data Interchange Format — Section 8.1 requires UTF-8 on the wire; section 7 covers string content and escape syntax
- Unicode UAX #29 — Text Segmentation — The authoritative grapheme cluster boundary rules that Intl.Segmenter implements
- Unicode CLDR emoji-zwj-sequences — Every ZWJ sequence font vendors are expected to render as a combined glyph
- MDN — Intl.Segmenter — The built-in grapheme, word, and sentence segmenter for JavaScript
- MySQL Reference — The utf8mb4 Character Set — Why utf8mb4 is the only safe choice for emoji and full UTF-8 in MySQL