JSON to Snowflake: Loading with COPY INTO, VARIANT Type, and Querying
Last updated:
Snowflake handles JSON natively through the VARIANT type — a single column can hold an arbitrary JSON value up to 16 MB compressed, and the engine decodes path expressions like v:user.email::STRING directly at query time. The loading pipeline is three pieces: a stage (a pointer to files in S3, Azure, GCS, or Snowflake-managed storage), a file format (instructions for parsing JSON), and a COPY INTO statement that reads from the stage and writes to a table. For ingesting an array of objects, the single most important flag is STRIP_OUTER_ARRAY = TRUE — without it you get one giant row per file instead of one row per object. Once loaded, querying is dot-and-colon notation plus optional LATERAL FLATTEN for arrays. This guide walks the whole pipeline, the cost and performance tradeoffs versus typed columns, and the validation patterns that catch malformed payloads early.
Before you ship JSON to Snowflake, sanity-check the payload. Paste it into Jsonic's JSON Validator — it catches trailing commas, unquoted keys, and bracket mismatches with exact line numbers, so COPY INTO doesn't silently skip rows on ON_ERROR = CONTINUE.
Snowflake VARIANT type: semi-structured JSON storage
VARIANTis Snowflake's built-in semi-structured type. A single VARIANT column can hold any JSON value — object, array, string, number, boolean, or null — up to 16 MB compressed per row. Internally Snowflake encodes the payload in a columnar binary format derived from the JSON structure, similar in spirit to Apache Arrow: repeated field names become column references, scalars are stored next to scalars of the same type, and compression operates on those physical columns. The practical effect is that reading v:user.email from a billion-row table scans only the bytes that hold the email field, not the whole document.
CREATE TABLE events (
raw VARIANT,
loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Specialized variants for stricter contracts:
CREATE TABLE users (
profile OBJECT, -- accepts only JSON objects
tags ARRAY -- accepts only JSON arrays
);OBJECT and ARRAY are specialized VARIANTs that constrain the top-level shape — an OBJECT column rejects a row whose value is a bare string or number at insert time. Storage and query cost are the same as VARIANT; the difference is type discipline. Most teams default to VARIANT for incoming JSON because source data is rarely as constrained as a strict OBJECT contract requires.
Documents larger than 16 MB compressed need to be split before insert — either shard the array at the source or unnest one level into multiple rows. See the JSON in databases guide for the cross-engine patterns.
Stage setup: internal stage vs external stage (S3/Azure/GCS)
A stage is a Snowflake-known location where files live before they become rows. There are two flavors. Internal stages are managed by Snowflake — you upload files with PUT from SnowSQL or a connector, and Snowflake stores them in its own cloud storage. External stages point at S3, Azure Blob Storage, or Google Cloud Storage and require credentials or a storage integration. External is the right choice when your data already lives in cloud storage (a typical ETL pipeline drops exports into a bucket); internal is simpler for ad-hoc loads and quick experiments.
-- External stage on S3 with an existing storage integration
CREATE STAGE my_stage
URL = 's3://my-bucket/exports/'
STORAGE_INTEGRATION = my_s3_int
FILE_FORMAT = my_json;
-- Internal stage (no URL needed)
CREATE STAGE internal_stage FILE_FORMAT = my_json;
-- Then upload from SnowSQL:
-- PUT file:///tmp/events.json @internal_stage;
-- Inspect what's in the stage
LIST @my_stage;Storage integrations are the recommended way to authenticate external stages — they replace static AWS keys with a Snowflake-managed IAM role that the bucket trusts. The integration is created once (CREATE STORAGE INTEGRATION), then any number of stages reference it. Static CREDENTIALS on the stage still work but mean rotating keys in two places.
The LIST @my_stage command shows the files Snowflake can see — run it before COPY INTO to confirm path patterns and detect missing uploads.
COPY INTO with FILE_FORMAT = (TYPE = JSON)
COPY INTO reads files from a stage, parses them with a file format, and writes rows to a target table. The file format is a reusable object that encodes the parsing rules — for JSON, that means at minimum TYPE = JSON, plus a handful of options that change how documents are split into rows.
-- File format: parse JSON, treat the outer array as a row source
CREATE OR REPLACE FILE FORMAT my_json
TYPE = JSON
STRIP_OUTER_ARRAY = TRUE
COMPRESSION = AUTO
ALLOW_DUPLICATE = FALSE
STRIP_NULL_VALUES = FALSE
IGNORE_UTF8_ERRORS = FALSE;
-- Load every file in the stage matching events_*.json
COPY INTO events (raw)
FROM @my_stage
FILE_FORMAT = (FORMAT_NAME = my_json)
PATTERN = '.*events_.*\.json'
ON_ERROR = CONTINUE
PURGE = FALSE;The flags worth knowing:
STRIP_OUTER_ARRAY = TRUE— the single most consequential flag for array-shaped exports (more on this in the next section)COMPRESSION = AUTO— auto-detects gzip, bzip2, brotli, zstd, deflate from file extensionSTRIP_NULL_VALUES = TRUE— dropsnullvalues from inserted JSON; useful when nulls inflate VARIANT size for no valueON_ERROR = CONTINUE— skip rows whose JSON fails to parse instead of aborting the whole load (pair withVALIDATION_MODEfor dry runs)PURGE = TRUE— delete source files after successful load (useful for queue-style buckets; risky for replays)
For loading a single column of raw JSON, the syntax is COPY INTO events (raw) — the parenthesized column list maps each parsed document to one column. To split fields into typed columns at load time, use a SELECT in the FROM clause instead.
Strip outer array vs one row per JSON object
Snowflake's default behavior is to treat every top-level JSON value in a staged file as one row. A file containing a single object becomes one row; a file containing five line-separated objects (JSONL) becomes five rows. A file containing a top-level array — the common shape from a JavaScript JSON.stringify(items) export — becomes one row, because the outer array is itself one JSON value. That is almost never what you want for analytics.
-- Source: events.json
[
{"id": 1, "type": "signup"},
{"id": 2, "type": "login"},
{"id": 3, "type": "purchase"}
]
-- Without STRIP_OUTER_ARRAY: 1 row, raw is the whole array
-- With STRIP_OUTER_ARRAY = TRUE: 3 rows, raw is each objectSTRIP_OUTER_ARRAY = TRUE tells the loader to walk the outer array and emit one row per element. Set it on the file format, not on the COPY INTO call directly, so the rule applies consistently across loads.
The alternative is to export the source as JSONL (newline-delimited JSON) with one object per line — Snowflake reads it as one row per line out of the box without any flag. JSONL is often the better choice for very large exports because Snowflake can split the file and load chunks in parallel; a single giant array file is harder to parallelize. See the JSONL format guide for the layout details.
To preview what each row will look like before committing to a load, query the stage directly:
SELECT $1 AS row_value
FROM @my_stage (FILE_FORMAT => my_json)
LIMIT 5;If each $1 shows a single object, you are ready to COPY INTO. If it shows the whole array, fix the file format first.
Querying VARIANT: dot notation, GET, GET_PATH, FLATTEN
Once the data is loaded, querying is path-based. The colon : introduces the first field, dots descend into nested objects, brackets index into arrays, and the :: operator casts the final value to a SQL type.
-- Source row in raw:
-- { "user": { "email": "a@b.com", "age": 30 },
-- "items": [ {"sku": "X1", "qty": 2}, {"sku": "X2", "qty": 1} ],
-- "ts": "2026-05-20T10:00:00Z" }
SELECT
raw:user.email::STRING AS email,
raw:user.age::NUMBER AS age,
raw:items[0].sku::STRING AS first_sku,
raw:ts::TIMESTAMP_NTZ AS event_ts,
ARRAY_SIZE(raw:items) AS item_count
FROM events
WHERE raw:user.age::NUMBER >= 18;GET and GET_PATH are function-form equivalents: GET(raw, 'user') is the same as raw:user, and GET_PATH(raw, 'user.email') is the same as raw:user.email. The function form is occasionally useful when the field name is dynamic (passed as a parameter), but the colon notation is shorter and idiomatic for static paths.
Missing paths return SQL NULLrather than raising an error, which makes VARIANT forgiving when fields are optional. If you need to distinguish between "the field is missing" and "the field is JSON null", use IS_NULL_VALUE(raw:user.email) for the JSON-null case versus the SQL IS NULL operator for missing-or-null.
For arrays, LATERAL FLATTEN turns a one-row-with-array result into a multi-row result:
SELECT
raw:user.email::STRING AS email,
f.value:sku::STRING AS sku,
f.value:qty::NUMBER AS qty,
f.index AS position
FROM events,
LATERAL FLATTEN(input => raw:items) f
WHERE raw:ts::DATE = '2026-05-20';f.value is the array element, f.index is its zero-based position, and f.this is the source array (useful for context). Add outer => TRUE to keep parent rows where the array is empty or null — equivalent to a LEFT JOIN against the unnested elements.
Schema-on-read: AUTOMATIC_CLUSTERING and Search Optimization
Schema-on-read means you load JSON as-is and let queries decide which fields to extract. For an exploratory or rarely-queried payload, that is fine — VARIANT scans are reasonably fast on their own. For tables where the same VARIANT path is filtered on millions of rows per day, two features dramatically improve performance: AUTOMATIC_CLUSTERING and the Search Optimization Service.
-- Cluster by a VARIANT path expression
ALTER TABLE events
CLUSTER BY (raw:event_type::STRING, raw:ts::DATE);
-- Enable Search Optimization for point lookups on a VARIANT path
ALTER TABLE events
ADD SEARCH OPTIMIZATION ON EQUALITY(raw:user.email);
-- Check whether the optimization is helping
SELECT SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS('events');AUTOMATIC_CLUSTERING physically reorganizes micro-partitions in the background so rows with similar cluster-key values land near each other. Queries that filter on the cluster key prune the scan to a small fraction of the table. The key can be a VARIANT path expression directly — no extracted column needed.
Search Optimization Service builds a per-table index for point lookups (WHERE raw:user.email = 'a@b.com') and string substring searches. It supports VARIANT paths via EQUALITY(raw:path), SUBSTRING(raw:path), and GEO targets. The trade is a constant background cost in credits to maintain the index — only enable it on fields that are queried frequently and filtered to small result sets.
For comparison with other warehouses, see the BigQuery comparison, Redshift comparison, and ClickHouse comparison — each picks a different balance between schema-on-read flexibility and typed-column performance.
JSON validation with TRY_PARSE_JSON, IS_OBJECT, IS_ARRAY
PARSE_JSON converts a string to a VARIANT and raises an error if the string is not valid JSON. TRY_PARSE_JSON is the safe variant — it returns SQL NULL on parse failure instead of failing the query. That makes it the right tool for validating incoming payloads inline.
-- Filter rows whose payload is not valid JSON
SELECT id, raw_string
FROM staging_events
WHERE TRY_PARSE_JSON(raw_string) IS NULL;
-- Type-check the parsed value
SELECT
id,
IS_OBJECT(TRY_PARSE_JSON(raw_string)) AS is_obj,
IS_ARRAY(TRY_PARSE_JSON(raw_string)) AS is_arr,
IS_NULL_VALUE(TRY_PARSE_JSON(raw_string)) AS is_json_null,
TYPEOF(TRY_PARSE_JSON(raw_string)) AS json_type
FROM staging_events;
-- Quarantine pattern: split good and bad rows in one INSERT
INSERT ALL
WHEN parsed IS NOT NULL THEN INTO events (raw) VALUES (parsed)
WHEN parsed IS NULL THEN INTO events_quarantine (raw_string) VALUES (raw_string)
SELECT raw_string, TRY_PARSE_JSON(raw_string) AS parsed
FROM staging_events;The companion functions cover the rest of the type-check space: IS_OBJECT, IS_ARRAY, IS_BOOLEAN, IS_INTEGER, IS_DECIMAL, IS_DOUBLE, IS_VARCHAR, and IS_NULL_VALUE (which detects JSON null distinct from a missing field). TYPEOF(v) returns the type as a string — useful for grouping or building diagnostics.
For at-rest validation against a JSON Schema, Snowflake does not ship a built-in validator — most teams either validate at the producer (before the file lands in the stage) or call out to a Python UDF that uses jsonschema. Pre-validating with Jsonic's JSON Validator catches the structural issues — quoting, brackets, trailing commas — before the file ever reaches Snowflake.
Cost & performance: VARIANT vs structured columns
Storage cost for VARIANT is close to typed columns in most workloads — Snowflake encodes VARIANT in a columnar binary format and applies the same compression as native types. A million rows of {"id": 1, "name": "..."} stored as VARIANT take roughly the same disk space as the same data in two typed columns. The gap shows up in query time: reading v:name::STRING is typically 1.5x to 3x slower than reading a native STRING column of the same value, because Snowflake decodes the VARIANT byte stream and walks the path expression per row.
| Approach | Storage | Read latency | Schema flexibility | Best for |
|---|---|---|---|---|
| Pure VARIANT | ~same as typed | 1.5x–3x typed | Full — any shape | Exploratory, rarely-queried, source-of-truth blobs |
| VARIANT + extracted columns | Slightly more | Typed for hot fields, VARIANT for the rest | Full | Production analytics — the common pattern |
| Fully typed (no VARIANT) | Lowest | Fastest | Brittle — schema changes need ALTER TABLE | Stable contracts (curated marts) |
| VARIANT + Search Optimization | +index overhead (credits) | Near-typed for indexed paths | Full | Point lookups on JSON fields at high QPS |
The standard production pattern is the second row: keep the full payload as VARIANT so nothing is lost, and extract the few hot fields into typed columns at load time using a transforming COPY INTO. That way hot queries are typed-column fast and exploratory queries still have the full payload to work with.
-- Transforming COPY INTO: typed columns + raw VARIANT in one pass
COPY INTO events (event_id, user_email, event_ts, raw)
FROM (
SELECT
$1:event_id::NUMBER,
$1:user.email::STRING,
$1:ts::TIMESTAMP_NTZ,
$1
FROM @my_stage (FILE_FORMAT => my_json)
)
ON_ERROR = CONTINUE;Compute cost on VARIANT is dominated by the warehouse size and the number of micro-partitions scanned, not by VARIANT decoding overhead per se — clustering on VARIANT paths plus Search Optimization for point lookups close most of the gap to typed columns at the cost of background credits.
Key terms
- VARIANT
- Snowflake's semi-structured type — a single column that can hold any JSON value up to 16 MB compressed per row, stored in a columnar binary format internally.
- stage
- A Snowflake-known location where files live before they become rows. Internal stages use Snowflake-managed storage; external stages point at S3, Azure Blob, or Google Cloud Storage.
- file format
- A reusable parsing configuration referenced by
COPY INTO. For JSON the key fields areTYPE = JSON,STRIP_OUTER_ARRAY,STRIP_NULL_VALUES, andCOMPRESSION. - STRIP_OUTER_ARRAY
- A file-format flag that tells the loader to walk a top-level JSON array and emit one row per element, instead of treating the whole array as one row.
- LATERAL FLATTEN
- A table function that expands a VARIANT array (or object keys) into a multi-row result joinable with the parent row. Produces
value,index,this, andseqcolumns. - Search Optimization Service
- An optional per-table index for point lookups and substring matches, including over VARIANT path expressions. Adds background credit cost in exchange for near-typed-column read latency on indexed paths.
- TRY_PARSE_JSON
- The safe form of
PARSE_JSON— returns SQLNULLon parse failure instead of raising an error, suitable for inline validation of staged or in-flight payloads.
Frequently asked questions
What's the difference between VARIANT, OBJECT, and ARRAY in Snowflake?
VARIANT is the general-purpose semi-structured type that can hold any JSON value — an object, an array, a string, a number, a boolean, or null. OBJECT is a specialized VARIANT that only accepts JSON objects (key/value maps), and ARRAY is a specialized VARIANT that only accepts JSON arrays. All three are stored the same way internally — Snowflake encodes them in a columnar binary format derived from the underlying values, so query performance is similar. The reason to pick OBJECT or ARRAY over VARIANT is type discipline: a column declared OBJECT will reject a row whose value is a bare string or number at insert time, catching schema drift earlier. In practice most teams use VARIANT for incoming JSON because the source data is rarely as constrained as a strict OBJECT contract would require, and reach for OBJECT or ARRAY only when modeling a known-shape field.
How do I load a JSON file containing an array of objects?
Set STRIP_OUTER_ARRAY = TRUE on the file format. By default Snowflake treats each JSON document in a staged file as one row, so a file containing a top-level array becomes a single row with an ARRAY-valued VARIANT. With STRIP_OUTER_ARRAY enabled the loader walks the outer array and emits one row per element instead, which is what you want for an export like [{"id":1},{"id":2},{"id":3}]. Create the format with CREATE FILE FORMAT my_json TYPE = JSON STRIP_OUTER_ARRAY = TRUE, then run COPY INTO target FROM @my_stage FILE_FORMAT = my_json. The alternative source layout is JSONL (newline-delimited JSON) — one object per line — which loads as one row per line without any strip flag. JSONL is often the better choice for very large exports because Snowflake can parallelize the load.
Why am I getting one row per file instead of one row per JSON object?
The file contains a top-level JSON array and STRIP_OUTER_ARRAY is not set on the file format. Snowflake reads the entire array as a single JSON value and writes it to one row. Two fixes: (1) recreate the file format with STRIP_OUTER_ARRAY = TRUE so the loader expands the outer array into one row per element, or (2) re-export the source as JSONL with one object per line and skip the strip flag entirely. To inspect what is happening, run SELECT $1 FROM @my_stage (FILE_FORMAT => my_json) LIMIT 5 — that shows you exactly what each row looks like before COPY INTO. If you see one giant array per row, the strip flag is the issue; if you see one object per row already, the problem is elsewhere (often a target table column type mismatch or an ON_ERROR setting silently skipping rows).
How do I query a nested field in a VARIANT column?
Use colon-and-dot path notation followed by a type cast. If a VARIANT column v holds {"user": {"email": "a@b.com", "age": 30}}, you can read the email with v:user.email::STRING and the age with v:user.age::NUMBER. The colon introduces the path expression and dots descend into nested objects. Array indexes go in brackets: v:items[0].name::STRING. The ::TYPE cast at the end is important — without it Snowflake returns a VARIANT, and joins, comparisons, and aggregations work better with proper SQL types. The GET and GET_PATH functions are equivalent but more verbose: GET_PATH(v, 'user.email')::STRING gives the same result. Missing paths return SQL NULL rather than raising an error, so v:does.not.exist::STRING is safe to reference — useful for tolerating optional fields in semi-structured data.
How does FLATTEN work on JSON arrays?
LATERAL FLATTEN expands an array (or the keys of an object) into a multi-row result set that you can join back to the parent row. The syntax is SELECT v:order_id, f.value:sku::STRING FROM orders, LATERAL FLATTEN(input => v:items) f. For each row in orders, FLATTEN produces one extra row per element in v:items; f.value holds the element itself, f.index holds its position, f.this holds the original array, and f.seq is a per-input sequence number. The OUTER keyword (LATERAL FLATTEN(input => v:items, outer => TRUE)) keeps parent rows where the array is empty or null instead of dropping them, which matters for LEFT-JOIN-like semantics. FLATTEN also walks object keys when you set the PATH option, and it can recurse with RECURSIVE => TRUE — useful for tree-shaped JSON like nested categories.
Can Snowflake infer a schema from JSON?
Yes — INFER_SCHEMA reads a staged file and returns the columns, types, and nullability it detects. SELECT * FROM TABLE(INFER_SCHEMA(LOCATION => '@my_stage', FILE_FORMAT => 'my_json')) gives a result set you can feed into CREATE TABLE ... USING TEMPLATE for an automatic schema. This is convenient for one-off loads or for getting a starting point, but the inferred schema reflects only the sample Snowflake actually read — fields that are missing or null in the sample will not appear, and rare types will be widened to the most permissive option. For production loads you typically keep one VARIANT column for the full payload plus a few extracted typed columns for the fields you query most, rather than auto-inferring every field as its own column. INFER_SCHEMA shines as a starting point, not a final design.
How much does it cost to store and query JSON vs typed columns?
Storage cost for VARIANT is similar to typed columns in most cases — Snowflake encodes VARIANT in a columnar binary format and compresses the same way it does for native types, so a million rows of {"id": 1, "name": "..."} stored as a VARIANT takes roughly the same disk space as the same data in two typed columns. Query cost is where the gap shows: reading v:name::STRING from VARIANT is typically 1.5x to 3x slower than reading a native STRING column of the same value, because Snowflake has to decode the VARIANT byte stream and walk the path expression for every row. For frequently queried fields, the standard pattern is to extract them at load time into typed columns alongside the raw VARIANT — your COPY INTO writes both — so hot queries read the typed columns and cold or exploratory queries still have the full payload available. AUTOMATIC_CLUSTERING and Search Optimization both work on VARIANT paths.
How do I extract a JSON value as a specific type (CAST or :: notation)?
Use the :: shorthand or the explicit CAST function — both work identically. v:user.age::NUMBER and CAST(v:user.age AS NUMBER) produce the same result. The supported target types include STRING (or VARCHAR), NUMBER (or its aliases INT, BIGINT, DECIMAL), FLOAT, BOOLEAN, DATE, TIMESTAMP_NTZ, TIMESTAMP_LTZ, and TIMESTAMP_TZ. If the JSON value cannot be coerced — for example, casting "not a number" to NUMBER — Snowflake raises an error and the query fails. Use TRY_CAST instead for safe casting that returns NULL on failure: TRY_CAST(v:user.age AS NUMBER). String values that look like numbers in JSON ("42" vs 42) need an explicit cast because VARIANT preserves the original type; the :: operator handles the conversion. For dates and timestamps, the JSON value must match the configured input format or you must supply one explicitly with TO_TIMESTAMP(v:created_at::STRING, 'YYYY-MM-DD HH24:MI:SS').
Further reading and primary sources
- Snowflake Docs — Loading JSON — End-to-end tutorial for stages, file formats, and COPY INTO with JSON
- Snowflake Docs — VARIANT, OBJECT, ARRAY — Reference for the semi-structured types and their storage characteristics
- Snowflake Docs — LATERAL FLATTEN — Full syntax for unnesting arrays and object keys into rows
- Snowflake Docs — Search Optimization Service — Indexed point lookups and substring matches on VARIANT paths
- Snowflake Docs — COPY INTO <table> — Authoritative reference for COPY INTO flags including STRIP_OUTER_ARRAY and ON_ERROR