Elasticsearch JSON: Index, Query, and Map Documents

Q: What is the difference between a match query and a term query in Elasticsearch?

A match query uses the same analyzer chain that was applied during indexing — it tokenizes the input, lowercases it, and removes stop words — making it ideal for full-text fields of type text. A term query is exact-match and bypasses the analyzer entirely, making it correct for keyword, number, boolean, and date fields. If you use a term query on a text field, you will almost always get zero results because Elasticsearch stored the analyzed (lowercased, tokenized) form, not the original string. For example, a term query for "Quick Brown Fox" against an analyzed text field returns nothing because the indexed tokens are "quick", "brown", and "fox". Conversely, a match query on a keyword field may behave unexpectedly because keyword fields store the raw value and match applies analysis to the query string. The general rule: use match for human-readable text search, use term for structured values like IDs, status codes, and tags.

Q: How do I bulk index JSON documents in Elasticsearch?

Use the bulk API at POST /_bulk with an NDJSON body. Each document requires two lines: an action line and a document line, separated by a newline character. The action line specifies the operation type (index, create, update, or delete) and optional metadata such as _index and _id. For example: {"index":{"_index":"products"}} followed by {"name":"Widget","price":9.99}. The request Content-Type must be application/x-ndjson, not application/json. Ideal batch size is 5 to 15 MB or 1,000 to 5,000 documents per request — larger batches increase memory pressure on the cluster. The response contains an errors boolean at the top level; if true, inspect the items array for per-document result and error fields. Failed items do not abort the batch — other documents in the same request are still processed. See the NDJSON format guide for details on the line-delimited structure.

Q: What causes a mapping explosion in Elasticsearch?

A mapping explosion happens when dynamic mapping creates a new field entry for every unique JSON key in indexed documents, causing the mapping to grow without bound. If your documents contain user-defined metadata or deeply nested objects with thousands of unique keys — such as application event properties, IoT sensor tags, or CRM custom fields — each new key triggers a mapping update cluster-wide. This consumes heap memory on every node and eventually causes OutOfMemoryError crashes. The fix is to disable dynamic mapping at the index level with "dynamic": "strict" (rejects unknown fields with a 400 error) or "dynamic": false (silently ignores unknown fields without indexing them). Use explicit mappings to define only the fields you need to query. For semi-structured data where you need ad-hoc querying, consider using the flattened field type, which stores an entire JSON object as a single keyword-like field without creating individual field mappings for each key.

Q: How do I search nested JSON objects in Elasticsearch?

Use the nested field type in your mapping and the nested query in your search request. By default, Elasticsearch flattens arrays of objects into parallel arrays of values, losing the association between sibling fields within each object. For example, if you index [{name:"Alice",role:"admin"},{name:"Bob",role:"viewer"}] as a plain object array, a query for name:Alice AND role:viewer incorrectly matches because the values are flattened independently. The nested type stores each array element as a separate hidden document, preserving field correlations. In your mapping, set "type": "nested" on the field. In your query, use a nested query with a path pointing to the nested field and an inner query block. Note that nested queries are more expensive than regular queries because Elasticsearch must join the hidden inner documents back to the parent. For simple lookups, consider denormalizing the data instead.

Q: Can I use JSON arrays as Elasticsearch fields?

Yes — all Elasticsearch field types implicitly support arrays without any special mapping configuration. A text field declared for a single string value can also store an array of strings such as ["hello","world"]; Elasticsearch indexes each element individually and searches across all values. The same applies to numeric, keyword, date, and boolean fields. The only constraint is that all values in the array must be of the same data type as declared in the mapping — mixing strings and numbers in a single array field is not supported. For arrays of objects where you need to query correlated field values within each object (for example, finding documents where a specific tag has both a name and a value), you must use the nested field type. For arrays of objects where you only need to query individual field values independently, a regular object type is sufficient and more efficient.

Q: How do I paginate through more than 10,000 Elasticsearch results?

Use the search_after parameter with a sort field. The default from+size pagination is limited by index.max_result_window, which defaults to 10,000 documents. Increasing this limit is not recommended because deep from+size pagination requires Elasticsearch to fetch and discard all documents up to the offset on every shard. Instead, add a sort clause to your query (a unique field like _id or a timestamp is ideal), execute the first page, then pass the last hit's sort values as the search_after array in the next request. Repeat until you receive fewer results than the page size. For a consistent snapshot of the index across pages (important for export or migration use cases), open a point-in-time (PIT) with POST /index/_pit?keep_alive=5m and include the pit.id in each search_after request — this ensures pages reflect the same index state even if documents are added or deleted between requests.

Elasticsearch stores and retrieves data exclusively as JSON documents. Every index, query, mapping, and cluster setting is expressed in JSON, sent and received over a REST HTTP API. A document indexed to POST /products/_doc is a plain JSON object; a search request to GET /products/_search is also a JSON body. The core JSON types — strings, numbers, booleans, arrays, nested objects — map directly to Elasticsearch data types. A 3-node cluster can handle over 100,000 indexing operations per second with default settings. This guide covers indexing documents, writing queries, defining explicit mappings, using the bulk NDJSON format API, and avoiding common pitfalls like dynamic mapping explosions. Use the JSON formatter to inspect and validate document bodies before sending them to the cluster.

Need to validate or pretty-print a JSON document before indexing it into Elasticsearch? Jsonic's formatter handles it instantly.

Open JSON Formatter

Indexing JSON Documents

The fastest way to get a JSON document into Elasticsearch is a single HTTP POST. Elasticsearch generates a unique _id automatically, returns the assigned ID in the response, and makes the document searchable within 1 second by default (controlled by refresh_interval). Every index operation returns 5 key fields: _index, _id, _version, result, and _shards.

# Auto-generate _id
POST /logs/_doc
{
  "level":   "error",
  "message": "disk full",
  "ts":      1715000000
}

# Response
{
  "_index":   "logs",
  "_id":      "abc123XYZ",
  "_version": 1,
  "result":   "created",
  "_shards":  { "total": 2, "successful": 1, "failed": 0 }
}

To supply an explicit _id — useful when the document has a natural key like a product SKU or user UUID — use PUT with the ID in the path. PUT also acts as an upsert: if the document already exists, it is replaced and _version increments. To retrieve a document by ID, send a GET to /index/_doc/id. The response wraps the original JSON under _source. To remove a document, send DELETE to the same path — the response returns "result": "deleted".

# Explicit _id — upserts if already exists
PUT /products/_doc/sku-9001
{
  "name":     "Widget Pro",
  "price":    49.99,
  "in_stock": true
}

# Retrieve by _id
GET /products/_doc/sku-9001
# → { "_source": { "name": "Widget Pro", "price": 49.99, "in_stock": true } }

# Delete by _id
DELETE /products/_doc/sku-9001
# → { "result": "deleted" }

Elasticsearch auto-creates the index on first write if it does not exist — convenient for development, but disable this in production with action.auto_create_index: false to prevent accidental index creation from typos. With 3 primary shards (the default), a single-node cluster can sustain over 5,000 index operations per second for kilobyte-sized documents.

Writing JSON Queries

Every Elasticsearch search is a JSON object sent in the request body of a GET or POST to /_search or /index/_search. The top-level query key holds the query clause; aggs holds aggregations;sort controls ordering; and from/size paginate results. Elasticsearch supports 4 essential leaf query types that cover over 90% of production use cases.

# match — full-text search on analyzed text fields
GET /products/_search
{
  "query": {
    "match": {
      "description": {
        "query":                "wireless noise cancelling",
        "minimum_should_match": "75%"
      }
    }
  }
}

# term — exact match for keyword/number/boolean fields
GET /products/_search
{
  "query": { "term": { "status": "active" } }
}

# range — numeric or date range
GET /logs/_search
{
  "query": {
    "range": {
      "ts": { "gte": 1715000000, "lte": 1715086400 }
    }
  }
}

# bool — combine multiple clauses
GET /products/_search
{
  "query": {
    "bool": {
      "must":     [ { "match": { "name": "widget" } } ],
      "filter":   [ { "term":  { "in_stock": true } } ],
      "must_not": [ { "term":  { "status": "discontinued" } } ],
      "should":   [ { "term":  { "featured": true } } ]
    }
  },
  "aggs": {
    "avg_price": { "avg": { "field": "price" } }
  }
}

The filter context inside a bool query is critical for performance: clauses in filter do not compute a relevance score, so Elasticsearch can cache their results across queries. Move any clause that doesn't affect ranking (status checks, date ranges, boolean flags) into filter rather than must. On an index with 10 million documents, this alone can reduce query latency from 200 ms to under 20 ms. For structured field exploration, see JSONPath queries for client-side traversal of the _source objects returned by Elasticsearch.

Aggregations in aggs run alongside the query and return computed metrics (averages, sums, histograms, cardinality) without a separate request. A single query can contain 50+ aggregations — useful for building faceted search UIs where counts, price ranges, and tag clouds all come back in one round trip.

Defining Explicit Mappings

Explicit mappings are the single most important production practice in Elasticsearch. Dynamic mapping is convenient for exploration but causes 3 production problems: it creates the wrong field type (a numeric string becomes a text field instead of keyword), it allows mapping explosions, and it makes the schema implicit and hard to audit. Define mappings before writing any data using PUT /index with a mappings.properties block.

PUT /products
{
  "settings": {
    "number_of_shards":   3,
    "number_of_replicas": 1,
    "refresh_interval":   "1s"
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "raw": { "type": "keyword" }
        }
      },
      "description": { "type": "text" },
      "status":      { "type": "keyword" },
      "price":       { "type": "float" },
      "in_stock":    { "type": "boolean" },
      "created_at": {
        "type":   "date",
        "format": "strict_date_optional_time"
      },
      "tags": {
        "type":   "nested",
        "properties": {
          "key":   { "type": "keyword" },
          "value": { "type": "keyword" }
        }
      }
    }
  }
}

Multi-fields (the .raw subfield above) let you index a string as both text for full-text search and keyword for exact-match, sorting, and aggregations — without storing the data twice at the document level. Reference the subfield as name.raw in queries and aggregations.

One critical constraint: Elasticsearch cannot change the data type of an existing field in a live index. If you need to change price from float to double, you must create a new index with the corrected mapping and use the reindex API to migrate data: POST /_reindex with source.index and dest.index. Plan your mapping carefully upfront and validate it against your JSON Schema before the first write.

Bulk Indexing with NDJSON

The bulk API at POST /_bulk is the correct tool for any ingestion of more than 1 document at a time. Sending 1,000 documents individually requires 1,000 HTTP round trips; sending them as a single bulk request requires 1. A well-tuned bulk pipeline can sustain over 100,000 documents per second on a 3-node cluster. Each request body is NDJSON format: alternating action lines and document lines, each terminated by a newline character.

POST /_bulk
{"index":{"_index":"products","_id":"sku-1"}}
{"name":"Widget","price":9.99,"in_stock":true}
{"index":{"_index":"products","_id":"sku-2"}}
{"name":"Gadget","price":24.99,"in_stock":false}

The 4 supported action types are: index (create or replace), create (fail if document already exists — useful for deduplication), update (partial update using a doc subkey), and delete (no document line needed). The Content-Type header must be application/x-ndjson, not application/json — sending the wrong type returns a 400 error.

# Mixed action types in one bulk request
POST /_bulk
{"create":{"_index":"products","_id":"sku-3"}}
{"name":"Doohickey","price":4.99,"in_stock":true}
{"update":{"_index":"products","_id":"sku-1"}}
{"doc":{"in_stock":false}}
{"delete":{"_index":"products","_id":"sku-2"}}

The bulk response has a top-level errors boolean. If true, iterate the items array and check each item's error field. Failed items do not abort the batch — the other 999 documents in the same request are still processed. Sweet-spot batch size for most workloads is 5–15 MB uncompressed or 1,000–5,000 documents. Benchmark both dimensions: very large batches increase heap pressure and GC pauses on the coordinating node.

Common Pitfalls

Five Elasticsearch JSON pitfalls account for the majority of production incidents. Understanding them before they hit saves hours of debugging and potential data loss. All 5 are preventable with proper mapping design and query discipline applied from the start of a project.

1. Mapping explosion from dynamic string fields. Dynamic mapping creates a field entry for every unique JSON key. Documents with thousands of unique keys (user metadata, CRM properties, IoT tags) grow the mapping unbounded and cause OutOfMemoryError on all nodes. Fix: set "dynamic": "strict" or "dynamic": false at the index level and use explicit mappings. For semi-structured data, use the flattened type.

2. Confusing _source with stored fields. By default, Elasticsearch stores the original JSON document in _source and returns it in search hits. Disabling _source saves disk space but breaks the reindex API, update API, and highlighting. Disable it only for append-only write-heavy indices where you never need to retrieve or update the original document.

3. Using term on text fields. A term query for "Quick Brown Fox" on an analyzed text field returns 0 results because the indexed tokens are lowercase and split. Always use match for text fields and term for keyword, number, and date fields. Multi-fields (name.raw) solve the case where you need both.

4. Deep pagination with from+size beyond 10,000. Setting from: 9990, size: 10 forces Elasticsearch to fetch 10,000 documents per shard, sort them, and discard the first 9,990 — multiplied by the number of shards. Use search_after with a sort field for cursor-based pagination instead. Pass the last hit's sort values as the search_after array in each subsequent request.

5. Arrays of objects without nested type. Indexing [{ "name": "Alice", "role": "admin" }, { "name": "Bob", "role": "viewer" }] as a plain object array flattens fields into separate arrays, losing field correlations. A query for name:Alice AND role:viewer matches incorrectly. Map array-of-objects fields as nested and use the nested query type to preserve inner-document field associations.

Key Elasticsearch JSON Terms

Document: A JSON object stored in an Elasticsearch index, analogous to a row in a relational database; each document has a unique _id and a _source field containing the original JSON.
Mapping: A schema definition that tells Elasticsearch how to index each JSON field — including its data type (text, keyword, date, nested, etc.) and analysis settings.
Dynamic mapping: Elasticsearch's default behavior of automatically creating a field mapping entry the first time a new JSON key appears in an indexed document, which is convenient for development but dangerous in production due to the risk of mapping explosions.
NDJSON (Newline-Delimited JSON): A format used by the Elasticsearch bulk API where each line is a complete JSON object and lines are separated by newline characters; the bulk API expects alternating action lines and document lines in this format.
Analyzer: A pipeline applied to text fields during indexing and query time that tokenizes, lowercases, and optionally removes stop words or applies stemming; the match query uses the same analyzer as indexing, while term queries bypass it entirely.
Nested type: An Elasticsearch field type that stores each element of a JSON array-of-objects as a separate hidden document, preserving the correlation between sibling fields within each object and enabling accurate queries across them.
search_after: A cursor-based pagination mechanism where the sort values of the last result in a page are passed as the search_after parameter in the next request, enabling efficient traversal of result sets larger than 10,000 documents without the heap overhead of deep from+size pagination.

Frequently asked questions

What is the difference between a match query and a term query in Elasticsearch?

A match query uses the same analyzer chain that was applied during indexing — it tokenizes input, lowercases it, and removes stop words — making it ideal for full-text text fields. A term query is exact-match and bypasses the analyzer entirely, making it correct for keyword, number, boolean, and date fields. If you use a term query on a text field, you will almost always get zero results because Elasticsearch stored the analyzed (lowercased, tokenized) form, not the original string. For example, a term query for "Quick Brown Fox" against an analyzed text field returns nothing because the indexed tokens are "quick", "brown", and "fox". Use match for human-readable text search and term for structured values like IDs, status codes, and tags. Multi-fields (a name field with a name.raw keyword subfield) let you apply both query types to the same data. Use the JSON formatter to inspect query bodies before sending them to Elasticsearch.

How do I bulk index JSON documents in Elasticsearch?

Use the bulk API at POST /_bulk with an NDJSON format body. Each document requires two lines: an action line and a document line. The action line is a JSON object such as {"index":{"_index":"products"}} and the document line is the full JSON document such as {"name":"Widget","price":9.99}. Set the Content-Type header to application/x-ndjson. Ideal batch size is 5–15 MB uncompressed or 1,000–5,000 documents — larger batches increase GC pressure on the coordinating node. The response contains a top-level errors boolean; if true, inspect the items array for per-document errors. Failed items do not abort the rest of the batch, so you must check each item individually on error. For very high throughput, run multiple parallel bulk threads (3–5) rather than a single large thread.

What causes a mapping explosion in Elasticsearch?

Dynamic mapping creates a new field entry for every unique JSON key in indexed documents. If your documents contain user-defined metadata or objects with thousands of unique keys — such as application event properties, IoT sensor tags, or CRM custom fields — the mapping grows without bound and eventually causes OutOfMemoryError crashes across all cluster nodes. Each new field mapping update must be propagated to every node, generating cluster state updates that can destabilize a busy cluster. The fix is to set "dynamic": "strict" (rejects unknown fields with a 400 error) or "dynamic": false (silently ignores unknown fields) at the index level and define only the fields you need to query in explicit mappings. For semi-structured data where you need ad-hoc querying of arbitrary keys, use the flattened field type, which stores an entire JSON object as a single field without generating per-key mapping entries.

How do I search nested JSON objects in Elasticsearch?

Use the nested field type in your mapping and the nested query type in your search. By default, Elasticsearch flattens arrays of objects into parallel arrays of values, losing the association between sibling fields within each object. For example, if you index [{"name":"Alice","role":"admin"},{"name":"Bob","role":"viewer"}] as a plain object array, a query for name:Alice AND role:viewer incorrectly matches because the values are separated into independent arrays. The nested type stores each element as a separate hidden document, preserving field correlations. In your mapping set "type": "nested" on the field. In your query, wrap the inner query in a nested block with a path pointing to the nested field. Nested queries are more expensive than flat queries, so use them only when field correlation is required.

Can I use JSON arrays as Elasticsearch fields?

Yes — all Elasticsearch field types implicitly handle arrays without any special mapping configuration. A text field can store ["hello", "world"] and Elasticsearch indexes each element individually, searching across all values transparently. The same applies to keyword, numeric, date, and boolean fields. The only constraint is that all array values must match the declared field type — mixing strings and numbers in a single array is not supported. For arrays of objects where you need to query correlated field values within each object, use the nested type as described above. For arrays of objects where you only need to query individual field values independently (no cross-field correlation needed), a regular object type is sufficient and significantly cheaper to query. Validate your document structure with a JSON Schema before indexing to catch type mismatches early.

How do I paginate through more than 10,000 Elasticsearch results?

Use search_after with a sort field. The default from+size pagination is capped by index.max_result_window, which defaults to 10,000. Increasing this limit is not recommended because deep from+size pagination forces Elasticsearch to fetch and discard all preceding documents on every shard. Instead, add a sort clause using a unique field such as _id or a timestamp, execute the first page, and pass the last hit's sort values array as the search_after parameter in the next request. Repeat until fewer results than the page size are returned. For a consistent point-in-time snapshot across pages — critical for export or migration use cases where documents may change between requests — open a PIT with POST /index/_pit?keep_alive=5m and include the pit.id in each search_after request. This ensures all pages reflect the same index state even if writes occur between paginated requests.

Ready to work with Elasticsearch JSON?

Use Jsonic's JSON Formatter to validate and pretty-print documents and query bodies before sending them to Elasticsearch. You can also use JSONPath queries to explore the _source objects returned in search hits.

Open JSON Formatter