JSON to Amazon Redshift: SUPER Type, COPY JSON, and Redshift Spectrum
Last updated:
Amazon Redshift handles JSON through three distinct paths: the SUPER data type for semi-structured columns inside the warehouse, the COPY ... FORMAT JSON command for bulk loading from S3, and Redshift Spectrum for querying JSON files in S3 as external tables without loading them at all. The query language across all three is PartiQL — a SQL extension that adds dot notation for object fields and bracket indexing for arrays. This guide walks through each path with working DDL and COPY commands, covers the jsonpath file format for non-trivial mappings, explains the 1 MB per-value SUPER ceiling and when to break out flat columns, and compares Redshift against Snowflake VARIANT and BigQuery JSON so you can pick the right warehouse for JSON-heavy workloads.
Before COPY chokes on a malformed file, sanity-check the JSON shape locally. Paste a sample record into Jsonic's JSON Validator — it flags trailing commas, unquoted keys, and UTF-8 BOMs that Redshift rejects with cryptic Invalid JSON object errors.
SUPER data type: PartiQL semi-structured storage
SUPERis Redshift's native column type for semi-structured data. A SUPER value can hold an object, array, scalar, or null — anything that round-trips through JSON — and is queried with PartiQL extensions to standard SQL. The per-value size limit is 1 MB of serialized data, which covers nearly all event payloads but rules out storing entire documents or large image metadata blobs in a single row.
-- Create a table with a SUPER column for semi-structured payloads
CREATE TABLE events (
event_id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
event_type VARCHAR(64) NOT NULL,
occurred_at TIMESTAMPTZ NOT NULL,
payload SUPER
)
DISTKEY (user_id)
SORTKEY (occurred_at);Redshift stores SUPER values in a self-describing binary format derived from the input JSON. Reads do not re-parse text on every query — the format encodes field names and types once, then references them positionally. That keeps the storage compact but does not give SUPER the same automatic per-field pruning that Snowflake VARIANT or BigQuery JSON offer: a query touching payload.user.email still scans the surrounding bytes of each SUPER value, just without re-parsing the text.
The pragmatic rule: SUPER shines when the schema varies between rows or when you want a faithful staging area for raw events. For hot analytic queries that always touch the same fields, promote those fields to typed columns alongside SUPER and let the optimizer use sort keys and zone maps.
COPY JSON with jsonpath file vs auto
The COPY ... FORMAT JSON command has two modes. 'auto' tells Redshift to match top-level JSON keys to column names case-insensitively — the shortest path when the JSON shape mirrors your table. 's3://.../jsonpath.json' points to a separate JSONPath file that explicitly lists which JSON paths feed which columns, in column order. Use the jsonpath form whenever names diverge, fields are nested, or you only want a subset of keys.
-- Mode 1: COPY with FORMAT JSON 'auto' — keys must match column names
COPY events
FROM 's3://my-bucket/events/2026/05/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoad'
FORMAT JSON 'auto'
GZIP
TIMEFORMAT 'auto';
-- Mode 2: COPY with explicit jsonpath file
COPY events_flat (event_id, user_email, first_sku, occurred_at)
FROM 's3://my-bucket/events/2026/05/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoad'
FORMAT JSON 's3://my-bucket/config/events-jsonpath.json'
GZIP;The jsonpath file itself is a single JSON document with one entry per target column, in the same order as the column list in the COPY command:
{
"jsonpaths": [
"$.id",
"$.user.email",
"$.items[0].sku",
"$.occurred_at"
]
}Compression is detected automatically from the file extension — .gz, .bz2, and .zst all work without explicit flags, though the GZIP/BZIP2/ZSTD hints in the COPY statement make intent obvious to readers. Both NDJSON (one object per line) and JSON-array files are accepted; Redshift figures out which by sniffing the first non-whitespace character.
Loading JSON from S3 with IAM role
Redshift never touches S3 with your IAM user credentials directly. Instead, you create an IAM role that trusts redshift.amazonaws.com as a service principal, attach a policy granting at minimum s3:GetObject on the source bucket and s3:ListBucket on the bucket itself, then associate the role with your provisioned cluster or serverless workgroup. The COPY command references the role by ARN — Redshift assumes it at load time.
-- Trust policy on the role (allows Redshift to assume it)
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Service": "redshift.amazonaws.com" },
"Action": "sts:AssumeRole"
}]
}
-- Permissions policy attached to the role
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::my-bucket/events/*"
},
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::my-bucket",
"Condition": {
"StringLike": { "s3:prefix": ["events/*"] }
}
}
]
}Keep individual JSON files between 1 MB and 1 GB after compression. Redshift loads in parallel across cluster slices — too-small files waste slice startup overhead, too-big files prevent parallelism. A common pattern is hourly NDJSON dumps gzip-compressed to roughly 100 MB per file under an S3 prefix partitioned by date. For wider patterns that span beyond Redshift itself, see our notes on AWS JSON patterns covering Kinesis Firehose, Glue, and Athena.
Querying SUPER: dot notation, [] indexing, UNPIVOT
PartiQL adds two operators to standard SQL for navigating SUPER values: dot notation for object fields and square brackets for array indexing. Both compose freely. A path that does not exist on a particular row returns NULL rather than erroring, so queries against ragged JSON do not need defensive IS NOT NULL checks until the final filter.
-- Given payload = {"user": {"email": "a@b.com"}, "items": [{"sku": "X"}, {"sku": "Y"}]}
-- Dot notation for object fields
SELECT
payload.user.email::VARCHAR AS email,
payload.items[0].sku::VARCHAR AS first_sku,
payload.items[1].sku::VARCHAR AS second_sku
FROM events
WHERE event_type = 'order_placed';
-- UNPIVOT a SUPER object: one row per key/value pair
SELECT event_id, attr_key, attr_value
FROM events e, UNPIVOT e.payload AS attr_value AT attr_key
WHERE attr_value IS NOT NULL;
-- Casting SUPER scalars for predicate pushdown
SELECT COUNT(*)
FROM events
WHERE payload.user.country::VARCHAR = 'DE';UNPIVOT turns a SUPER object into rows of (key, value) pairs — useful when keys themselves are data (think attribute bags or feature flags). The inverse UNNEST (covered in the next section) does the same for arrays.
Cast SUPER scalars to typed values with the ::TYPEsyntax whenever the result feeds a predicate, join, or aggregate that benefits from typed semantics. Without a cast, comparisons treat SUPER scalars dynamically, which works but can surprise the optimizer's row estimates.
PartiQL extensions: NEST and UNNEST
PartiQL extends SQL with first-class support for nested collections. UNNEST (spelled as a comma-separated FROM clause in Redshift) expands an array inside a SUPER value into one row per element. NEST (via OBJECT() and ARRAY() constructors) does the reverse — builds SUPER values out of relational data.
-- UNNEST an array column: one row per item
SELECT
e.event_id,
e.user_id,
i.sku::VARCHAR AS sku,
i.qty::INTEGER AS qty,
i.price::DECIMAL(10,2) AS price
FROM events AS e,
e.payload.items AS i
WHERE e.event_type = 'order_placed'
AND e.occurred_at >= DATE '2026-05-01';
-- NEST relational rows back into a SUPER array
SELECT
user_id,
ARRAY(
OBJECT(
'event_id', event_id,
'event_type', event_type,
'at', occurred_at
)
) AS recent_events
FROM events
WHERE occurred_at >= DATEADD(day, -7, CURRENT_DATE)
GROUP BY user_id;UNNEST in the FROM clause behaves like an implicit lateral join — each outer row produces zero or more inner rows depending on the array length. An empty array produces zero rows (the outer row is dropped); use a LEFT JOIN spelling with UNNEST if you want to preserve empty-array rows.
NEST is the building block for round-tripping data through JSON: aggregate relational rows, wrap them as SUPER objects, and either store them or hand them to a client that expects nested output. The JSON in databases guide compares this pattern across Postgres, MySQL, and warehouse engines.
Schema enforcement: typed columns vs SUPER trade-offs
The decision between SUPER and typed columns is rarely binary. Most production Redshift schemas land on a hybrid: SUPER for raw fidelity, typed columns for query speed on the fields that matter. Run a scheduled job that promotes high-traffic fields from SUPER into a flat columnar table, and keep the SUPER table as the system-of-record for backfills and audits.
| Concern | Typed columns | SUPER |
|---|---|---|
| Compression ratio | Best — per-column dictionaries | Lower — self-describing per-value format |
| Zone maps / sort key pruning | Yes | Limited — works on outer wrapper, not nested fields |
| Schema evolution | Requires ALTER TABLE | Free — new keys just appear |
| Per-value size limit | Column-dependent (e.g., VARCHAR(65535)) | 1 MB per row |
| Query syntax | Plain SQL | PartiQL extensions (dot, [], UNNEST) |
| Best for | Stable, hot, repeatedly queried fields | Variable shape, staging, audit fidelity |
A pattern that works well: load every event into events_raw (event_id, occurred_at, payload SUPER), then run a nightly INSERT INTO events_flat SELECT event_id, occurred_at, payload.user.email::VARCHAR, payload.items[0].sku::VARCHAR FROM events_raw to populate a flat table indexed on the fields product analytics actually queries. Two tables, one source of truth, predictable query speed.
Redshift Spectrum: querying JSON in S3 without loading
Redshift Spectrum lets you query JSON files in S3 directly, without running COPY. You register the data as an external table in an external schema (backed by AWS Glue Data Catalog), declare the columns and JSON SerDe, and queries against the external table fan out to a separate Spectrum compute pool that scans S3, with results streamed back through your cluster. Cost is per-byte-scanned (currently around $5 per TB) on top of the cluster's normal compute cost.
-- One-time setup: external schema pointing at a Glue database
CREATE EXTERNAL SCHEMA spectrum_events
FROM DATA CATALOG
DATABASE 'events_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrum'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
-- Register a JSON file set as an external table
CREATE EXTERNAL TABLE spectrum_events.raw_events (
event_id BIGINT,
user_id BIGINT,
event_type VARCHAR(64),
occurred_at TIMESTAMP,
payload SUPER
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true',
'dots.in.keys' = 'false'
)
LOCATION 's3://my-bucket/events-raw/'
TABLE PROPERTIES ('compression_type' = 'gzip');
-- Query just like a regular table
SELECT
COUNT(*) AS n,
payload.user.country::VARCHAR AS country
FROM spectrum_events.raw_events
WHERE occurred_at >= DATE '2026-05-01'
GROUP BY 2
ORDER BY n DESC;The Spectrum SerDe is org.openx.data.jsonserde.JsonSerDe — the same one Athena uses for JSON tables. The ignore.malformed.json property keeps one bad row from failing an entire scan; turn it off when correctness matters more than throughput.
Spectrum suits three workloads: exploratory queries against large JSON dumps where loading would be wasteful, federated queries that join S3 data with cluster tables, and audits where you want raw fidelity without paying for warehouse storage. For repeated queries, COPY into a SUPER or flat table once and query the loaded version.
Comparing Redshift JSON to BigQuery JSON and Snowflake VARIANT
All three warehouses support semi-structured JSON natively, but the design choices diverge in ways that matter for cost, query syntax, and storage limits.
| Feature | Redshift SUPER | Snowflake VARIANT | BigQuery JSON |
|---|---|---|---|
| Max value size | 1 MB | 16 MB (compressed) | 10 MB per row, 1 GB per column total |
| Per-field column pruning | Limited | Automatic (columnarization) | Automatic (sub-column extraction) |
| Field access syntax | Dot: payload.user.email | Colon: payload:user:email | Dot: payload.user.email |
| Cast operator | ::VARCHAR | ::STRING or :user::STRING | STRING(payload.user.email) |
| External JSON in object storage | Spectrum (S3 + Glue) | External Tables (S3/GCS/Azure) | External Tables (GCS) |
| JSON parsing function | JSON_PARSE | PARSE_JSON | PARSE_JSON / JSON_VALUE |
| Best fit | Existing Redshift estate, mixed structured + JSON | JSON-heavy with very large per-row payloads | Multi-tenant event analytics, ad-hoc nested queries |
For deeper coverage of the alternatives, see the BigQuery comparison, Snowflake comparison, and ClickHouse comparison guides. If your workload sits below 1 MB per record and you already pay for a Redshift cluster, SUPER is the path of least resistance. Above 1 MB, or for very wide nested schemas, VARIANT or BigQuery JSON give you more room and stronger automatic pruning.
JSON_PARSE and ad-hoc SUPER construction
JSON_PARSE converts a VARCHAR containing JSON text into a SUPER value. Use it when JSON arrives as a string column (from a CSV load, a webhook payload column, or a temporary staging table) and you want to query its structure without round-tripping through S3.
-- Convert a VARCHAR column to SUPER and query nested fields in one step
SELECT
event_id,
JSON_PARSE(payload_text).user.email::VARCHAR AS email
FROM staging_events
WHERE event_type = 'order_placed';
-- Persist the parsed value into a SUPER column
INSERT INTO events (event_id, occurred_at, payload)
SELECT event_id, occurred_at, JSON_PARSE(payload_text)
FROM staging_events;
-- Construct a SUPER inline (useful for tests and CTE seeds)
SELECT JSON_PARSE('{"user":{"email":"a@b.com"},"items":[{"sku":"X"}]}').items[0].sku::VARCHAR;JSON_PARSE raises an error on malformed input — wrap it in NULLIF or a CASE expression if your staging data may contain garbage you would rather skip than fail. The inverse, JSON_SERIALIZE, turns a SUPER value back into a VARCHAR for export or human-readable display.
Key terms
- SUPER
- Redshift's native column type for semi-structured data — JSON objects, arrays, scalars. Up to 1 MB per value. Queried with PartiQL extensions to SQL.
- PartiQL
- A SQL-superset query language that adds dot notation for object fields and bracket indexing for arrays. The lingua franca for SUPER columns, Spectrum external tables, and DynamoDB queries.
- jsonpath file
- A small JSON document referenced by
COPY ... FORMAT JSON 's3://.../jsonpath.json'. It maps each target column (in column order) to a JSONPath expression extracted from each source record. Required when JSON keys do not match column names or when fields need flattening. - COPY FORMAT JSON
- The Redshift bulk-load command for JSON files in S3. Accepts
'auto'for name-matched loads or an S3 URL pointing at a jsonpath file for explicit mapping. Handles gzip, bzip2, and zstd compression automatically. - Redshift Spectrum
- A query engine that scans data in S3 directly as external tables, without loading into the cluster. JSON files are registered via
CREATE EXTERNAL TABLEwith theorg.openx.data.jsonserde.JsonSerDeSerDe. Billed per byte scanned. - JSON_PARSE
- The Redshift function that converts a VARCHAR containing JSON text into a SUPER value. Inverse of
JSON_SERIALIZE. Useful for staging tables, ad-hoc parsing, and inline test data. - UNNEST
- The PartiQL pattern for expanding an array inside a SUPER value into one row per element. Written in Redshift as a comma-joined FROM clause:
FROM events e, e.payload.items i.
Frequently asked questions
What is the SUPER data type in Redshift?
SUPER is Amazon Redshift’s native column type for storing semi-structured data — JSON objects, arrays, and scalars — without flattening them into typed columns first. A SUPER value can hold up to 1 MB of serialized data per row and is queried with PartiQL, the SQL extension that supports dot notation for object fields and bracket indexing for arrays. Internally Redshift stores SUPER values in a self-describing binary format derived from the input JSON, so reads do not re-parse text on every query. You create a SUPER column the same way as any other column (CREATE TABLE events (id BIGINT, payload SUPER)), load values either with COPY FORMAT JSON or by calling JSON_PARSE on a VARCHAR, and query with payload.user.email or payload.items[0].sku. SUPER is the right choice when shape varies between rows or when downstream schemas are unstable.
How do I load a JSON file into Redshift?
The standard path is COPY from S3 with FORMAT JSON. Upload your JSON file (or files) to an S3 bucket, attach an IAM role to your Redshift cluster or serverless workgroup that has s3:GetObject on the bucket, then run COPY mytable FROM 's3://bucket/path/' IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoad' FORMAT JSON 'auto'. The 'auto' mode matches top-level JSON keys to column names case-insensitively — useful when the JSON shape mirrors your table. For mismatched names or nested extraction into typed columns, replace 'auto' with the S3 URL of a jsonpath file that lists JSONPath expressions one per column. Both newline-delimited JSON (NDJSON, one object per line) and JSON arrays are supported. COPY natively handles gzip, bzip2, and Zstandard-compressed files based on the file extension.
What's a jsonpath file and when do I need one?
A jsonpath file is a small JSON document that maps each target column to a JSONPath expression pulled out of the source record. Its shape is {"jsonpaths": ["$.id", "$.user.email", "$.items[0].sku"]} — one entry per column in your target table, in column order. You upload it to S3 and reference it in the COPY command as FORMAT JSON 's3://bucket/path/jsonpath.json'. Use a jsonpath file when JSON field names do not match column names (common with vendor exports), when you want to flatten nested fields into top-level typed columns at load time, when you need to pick a specific array element, or when the source has more keys than you want to store. If your JSON keys already match column names and you do not need flattening, FORMAT JSON 'auto' is simpler — no extra file to maintain.
Why is my JSON load failing with 'Invalid JSON object'?
Three causes cover most cases. First, the file is a JSON array of objects but COPY is reading it as NDJSON (or vice versa) — Redshift handles both, but the data must actually be valid JSON either way. Run the file through a strict validator before retrying. Second, character encoding: Redshift expects UTF-8 without a byte-order mark; UTF-16 or UTF-8-with-BOM files fail at the first character. Re-encode with iconv or a text editor that saves UTF-8 without BOM. Third, embedded literal newlines inside string values when the file is treated as NDJSON — every newline must terminate a record, so a newline inside a quoted string breaks the parser. Switch to single-line JSON per record or use a JSON array file. The STL_LOAD_ERRORS system view shows the exact offending line and column for any rejected row.
How do I query a nested field in a SUPER column?
PartiQL uses dot notation for object fields and square brackets for array indexing, with no JSON_EXTRACT call needed. For a SUPER column named payload holding {"user": {"email": "a@b.com"}, "items": [{"sku": "X"}, {"sku": "Y"}]}, SELECT payload.user.email, payload.items[0].sku FROM events returns the email and the first item SKU. Path navigation auto-unnests arrays in some contexts; for explicit unnesting use FROM events AS e, e.payload.items AS i, which expands each array element into its own row. Cast SUPER scalars to typed values with payload.user.email::VARCHAR when you need a fixed-width string or for predicate pushdown. Missing paths return NULL instead of erroring — use IS NOT NULL to filter out rows where a nested field is absent.
Can Redshift Spectrum query JSON without loading it?
Yes. Redshift Spectrum lets you query JSON files in S3 directly as external tables, with no COPY step. You register the data with CREATE EXTERNAL TABLE in an external schema (backed by AWS Glue Data Catalog or a Hive metastore), declaring the columns and pointing ROW FORMAT SERDE at org.openx.data.jsonserde.JsonSerDe with the file LOCATION set to the S3 prefix. Queries against the external table fan out to a separate Spectrum compute pool that scans S3 directly, with results streamed back through your Redshift cluster. Cost is per byte scanned from S3 (currently $5 per TB) plus the Redshift cluster’s usual cost. Spectrum suits exploratory analytics on large JSON dumps, federated queries that join S3 data with cluster tables, and one-off audits where loading would be wasteful.
Should I use SUPER or break JSON into flat columns?
Use flat typed columns when the schema is stable, when the data is hot for analytics, and when query patterns hit a small set of fields repeatedly — typed columns compress better, scan faster, and let the optimizer pick zone maps and sort keys effectively. Use SUPER when shape varies between rows (multi-tenant event payloads, vendor exports with optional fields), when downstream consumers care about full fidelity, or when you need a staging area before deciding which fields to promote. The pragmatic pattern is hybrid: COPY raw JSON into a SUPER column, run a scheduled INSERT INTO ... SELECT that pulls the high-traffic fields into a flat columnar table for fast queries, and keep the SUPER table around for backfills, audits, and rare ad-hoc questions. Best of both — predictable query performance plus full historical fidelity.
How does Redshift SUPER compare to Snowflake VARIANT?
Both are columnar warehouse JSON types, but with meaningful differences. Snowflake VARIANT can hold up to 16 MB per value (compressed); Redshift SUPER caps at 1 MB. Snowflake parses VARIANT into a fully columnar storage layout that auto-indexes every JSON field, so most queries skip irrelevant fields without explicit pruning. Redshift SUPER stores a self-describing binary blob and benefits less from automatic field-level pruning, so queries that touch large SUPER values can scan more bytes than the equivalent VARIANT query. On the query language side, VARIANT uses Snowflake-flavored SQL (colon syntax: payload:user:email) while SUPER uses PartiQL dot notation (payload.user.email). For workloads under the 1 MB ceiling, SUPER is plenty; above 1 MB or for very wide nested schemas, VARIANT scales further. The BigQuery comparison and ClickHouse JSON also offer alternative trade-offs.
Further reading and primary sources
- Amazon Redshift Docs — SUPER type — Authoritative reference for the SUPER data type, size limits, and storage format
- Amazon Redshift Docs — COPY from JSON — COPY FORMAT JSON modes, jsonpath file format, and supported compression codecs
- Amazon Redshift Docs — Querying semistructured data — PartiQL dot notation, bracket indexing, UNNEST, UNPIVOT, and casting SUPER scalars
- Amazon Redshift Docs — Spectrum external tables — CREATE EXTERNAL TABLE for JSON in S3 with the OpenX JsonSerDe
- PartiQL specification — The query language spec underlying SUPER queries in Redshift and several other AWS services