JSON in MongoDB Aggregation: $match, $group, $lookup, $project & Pipeline Patterns

Last updated:

MongoDB aggregation pipelines transform collections through a sequence of JSON stage documents — [{'{'}{"$match": {'{'}...{'}'}{'}'}, {'{'}{"$group": {'{'}...{'}'}{'}'}, {'{'}{"$project": {'{'}...{'}'}{'}'}] — each stage receiving the output of the previous. $match filters documents using query operators ($gt, $in, $regex), $group aggregates by a field with accumulators ($sum, $avg, $push), $lookup joins two collections without leaving the database, and $project reshapes the output document. $unwind converts a JSON array field into one document per element — a document with a 5-element tags array becomes 5 documents with identical parent fields. Aggregation pipelines are passed as JSON arrays to db.collection.aggregate() in the shell or as JavaScript arrays in the Node.js driver. This guide covers all major aggregation stages with real examples, performance tips ($match first for index use), $facet for multi-dimensional grouping, and the $expr operator for field comparisons.

$match and $project: Filtering and Reshaping Documents

$match is the aggregation equivalent of a find() query predicate. Place it first to engage collection indexes and minimize the number of documents processed by downstream stages. $project then reshapes each passing document: include fields with 1, exclude with 0, or compute new fields using aggregation expressions. The two stages together form the foundation of any pipeline — filter first, then reshape.

// ── Sample documents in the "orders" collection ───────────────
// { _id: ObjectId, customerId: "c1", status: "complete",
//   amount: 450, items: [{name:"Widget",price:150},{name:"Gadget",price:300}],
//   createdAt: ISODate("2026-03-15T10:00:00Z"), region: "us-east" }

// ── $match: filter documents (uses index if placed first) ──────
db.orders.aggregate([
  {
    $match: {
      status: "complete",          // equality match
      amount: { $gt: 100 },        // range filter — uses index on amount
      region: { $in: ["us-east", "eu-west"] }, // multi-value match
      createdAt: {
        $gte: new Date("2026-01-01"),
        $lt:  new Date("2026-04-01"),
      },
    },
  },
])

// ── $project: include, exclude, and compute new fields ────────
db.orders.aggregate([
  { $match: { status: "complete" } },
  {
    $project: {
      _id:        0,               // exclude _id
      customerId: 1,               // include as-is
      amount:     1,               // include as-is
      // Computed field — $sum over an array of sub-documents
      itemTotal: { $sum: "$items.price" },
      // String expression
      displayAmount: {
        $concat: ["$", { $toString: "$amount" }]
      },
      // Conditional expression
      tier: {
        $cond: {
          if:   { $gte: ["$amount", 1000] },
          then: "gold",
          else: "standard",
        },
      },
      // Date decomposition
      year:  { $year:  "$createdAt" },
      month: { $month: "$createdAt" },
    },
  },
])

// ── $match with $expr — field-to-field comparison ─────────────
// $expr lets you use aggregation expressions inside $match
// Useful for comparing two fields in the same document
db.inventory.aggregate([
  {
    $match: {
      $expr: { $gt: ["$sold", "$stock"] }, // sold > stock
    },
  },
])

// ── $addFields — add computed fields without removing others ──
// Unlike $project, $addFields keeps ALL existing fields
db.orders.aggregate([
  { $match: { status: "complete" } },
  {
    $addFields: {
      tax:   { $multiply: ["$amount", 0.1] },
      total: { $add: ["$amount", { $multiply: ["$amount", 0.1] }] },
    },
  },
])

The $expr operator inside $match is the key to field-to-field comparisons within the same document — something regular query operators cannot express. It wraps any aggregation expression, so you can use $gt, $eq, $and, and even conditional expressions like $cond inside $match. Note that $match with $expr cannot use a regular index on the compared fields — it requires an expression index or falls back to a collection scan. Use it when the logic is necessary, and supplement with a range $match that does use an index to reduce the document count first.

$group: Aggregating with $sum, $avg, $push, and $first

$group is the aggregation workhorse. It collapses many documents into one per unique _id value and computes accumulator expressions across each group. The _id field can be a field path ("$category"), a compound expression ({ year: { $year: "$date" }, month: { $month: "$date" } }), or null to aggregate the entire collection into a single document.

// ── Basic $group: revenue by status ──────────────────────────
db.orders.aggregate([
  {
    $group: {
      _id:          "$status",          // group by status field
      totalRevenue: { $sum: "$amount" }, // sum amount per group
      orderCount:   { $sum: 1 },         // count documents per group
      avgAmount:    { $avg: "$amount" }, // average amount
      maxAmount:    { $max: "$amount" }, // largest order
      minAmount:    { $min: "$amount" }, // smallest order
    },
  },
])
// Output: [
//   { _id: "complete", totalRevenue: 45000, orderCount: 120, avgAmount: 375, ... }
//   { _id: "pending",  totalRevenue: 8200,  orderCount: 32,  avgAmount: 256, ... }
// ]

// ── Compound _id: group by multiple fields ─────────────────
db.orders.aggregate([
  {
    $group: {
      _id: {
        region: "$region",
        year:   { $year:  "$createdAt" },
        month:  { $month: "$createdAt" },
      },
      revenue:    { $sum: "$amount" },
      orderCount: { $sum: 1 },
    },
  },
  { $sort: { "_id.year": -1, "_id.month": -1, revenue: -1 } },
])

// ── $push: collect values into an array ───────────────────────
db.orders.aggregate([
  {
    $group: {
      _id:        "$customerId",
      orderAmounts: { $push: "$amount" },         // array of amounts
      orderIds:     { $push: "$_id" },             // array of order _ids
    },
  },
])

// ── $addToSet: collect unique values ─────────────────────────
db.orders.aggregate([
  {
    $group: {
      _id:     "$customerId",
      regions: { $addToSet: "$region" }, // deduplicated array
    },
  },
])

// ── $first and $last: require preceding $sort ─────────────────
db.orders.aggregate([
  { $sort: { createdAt: 1 } },             // sort oldest-first
  {
    $group: {
      _id:         "$customerId",
      firstOrder:  { $first: "$createdAt" },  // earliest order date
      firstAmount: { $first: "$amount" },     // amount of first order
      latestOrder: { $last:  "$createdAt" },  // most recent order date
    },
  },
])

// ── $group with null _id: aggregate entire collection ─────────
db.orders.aggregate([
  {
    $group: {
      _id:         null,
      grandTotal:  { $sum: "$amount" },
      orderCount:  { $sum: 1 },
      avgOrder:    { $avg: "$amount" },
    },
  },
  {
    $project: { _id: 0 }, // remove the null _id field
  },
])

$first and $last are order-dependent: their values come from whichever document arrives first or last within each group. Without a preceding $sort stage, MongoDB does not guarantee document ordering — so always add { $sort: { ... } } before any $group that uses $first or $last. The $push accumulator builds arrays that can grow very large with high-cardinality groups — prefer $addToSet when uniqueness matters, or add a $project with {"$slice"} to cap array length.

$lookup: Joining Collections with Pipeline Stages

$lookup performs a left outer join between the working collection and another collection in the same database. Every matching document from the joined collection is appended as elements of an array field on the local document. Documents with no match receive an empty array. The pipeline form of $lookup unlocks filtering, projecting, and sub-aggregating the joined data before materializing the result.

// Collections:
// orders: { _id, customerId, amount, status }
// customers: { _id, name, email, tier }

// ── Basic $lookup: join orders with customers ─────────────────
db.orders.aggregate([
  { $match: { status: "complete" } },
  {
    $lookup: {
      from:         "customers",    // collection to join
      localField:   "customerId",   // field in current document
      foreignField: "_id",          // field in joined collection
      as:           "customer",     // output array field name
    },
  },
  // customer is now an array — unwind to get a single object
  { $unwind: "$customer" },
  {
    $project: {
      amount:          1,
      "customer.name": 1,
      "customer.tier": 1,
    },
  },
])

// ── Pipeline $lookup: filter joined documents before joining ──
// Index on customers._id is used for the $match inside the pipeline
db.orders.aggregate([
  { $match: { status: "complete" } },
  {
    $lookup: {
      from: "customers",
      let:  { cid: "$customerId" },   // expose local fields as variables
      pipeline: [
        {
          $match: {
            $expr: { $eq: ["$_id", "$$cid"] }, // join condition
          },
        },
        {
          $project: { name: 1, tier: 1, email: 1, _id: 0 },
        },
      ],
      as: "customer",
    },
  },
  { $unwind: { path: "$customer", preserveNullAndEmptyArrays: true } },
])

// ── $lookup with $match after: filter on joined data ─────────
db.orders.aggregate([
  {
    $lookup: {
      from:         "customers",
      localField:   "customerId",
      foreignField: "_id",
      as:           "customer",
    },
  },
  { $unwind: "$customer" },
  // Filter AFTER join — requires index on customer.tier for efficiency
  { $match: { "customer.tier": "gold" } },
  {
    $group: {
      _id:        "$customer._id",
      totalSpend: { $sum: "$amount" },
      orderCount: { $sum: 1 },
    },
  },
])

// ── Self-join with $lookup: tree structures ───────────────────
// categories: { _id, name, parentId }
db.categories.aggregate([
  {
    $lookup: {
      from:         "categories",
      localField:   "_id",
      foreignField: "parentId",
      as:           "children",
    },
  },
  { $match: { parentId: { $exists: false } } }, // top-level only
])

Performance note: the foreignField in a basic $lookup must be indexed for the join to run efficiently. Without an index MongoDB performs a full collection scan of the joined collection for every document in the pipeline — an O(n×m) operation. The pipeline form of $lookup also benefits from indexes inside the sub-pipeline's $match. After a $lookup, always use $unwind if you expect exactly one match per document, then $project to remove the now-unnecessary array wrapper and discard fields you will not use.

$unwind: Expanding JSON Arrays into Documents

$unwind deconstructs an array field into one output document per array element, duplicating all non-array fields across each output document. It is the bridge between array-structured JSON and the document-per-row model that $group operates on. Use the configuration object form to control behavior around missing or empty arrays.

// Sample document:
// { _id: 1, productId: "p1", tags: ["json", "mongodb", "nosql"],
//   reviews: [{rating: 5, text: "Great"}, {rating: 4, text: "Good"}] }

// ── Basic $unwind: one doc per tag ────────────────────────────
db.products.aggregate([
  { $unwind: "$tags" },
])
// Output (3 documents from 1 input):
// { _id: 1, productId: "p1", tags: "json",    reviews: [...] }
// { _id: 1, productId: "p1", tags: "mongodb", reviews: [...] }
// { _id: 1, productId: "p1", tags: "nosql",   reviews: [...] }

// ── $unwind with includeArrayIndex: track position ────────────
db.products.aggregate([
  {
    $unwind: {
      path:              "$tags",
      includeArrayIndex: "tagPosition", // adds 0-based index field
    },
  },
])
// { tags: "json",    tagPosition: 0 }
// { tags: "mongodb", tagPosition: 1 }

// ── $unwind with preserveNullAndEmptyArrays ───────────────────
// Without this: documents with missing/null/empty tags are DROPPED
// With this:    those documents pass through with tags as null
db.products.aggregate([
  {
    $unwind: {
      path:                       "$tags",
      preserveNullAndEmptyArrays: true,
    },
  },
])

// ── $unwind + $group: count tag frequency ─────────────────────
db.products.aggregate([
  { $unwind: "$tags" },
  {
    $group: {
      _id:   "$tags",
      count: { $sum: 1 },
    },
  },
  { $sort: { count: -1 } },
  { $limit: 10 }, // top 10 tags
])

// ── $unwind nested array: ratings per product ─────────────────
db.products.aggregate([
  { $unwind: "$reviews" },
  {
    $group: {
      _id:       "$_id",
      avgRating: { $avg: "$reviews.rating" },
      ratingCount: { $sum: 1 },
    },
  },
])

// ── $unwind + $match: filter to specific array elements ───────
db.orders.aggregate([
  { $unwind: "$items" },
  { $match: { "items.price": { $gt: 200 } } }, // only expensive items
  {
    $group: {
      _id:          "$_id",
      expensiveItems: { $push: "$items" },
      total:          { $sum: "$items.price" },
    },
  },
])

A common pipeline pattern is $unwind → $match → $group: unwind the array to get one document per element, immediately filter to elements matching some criteria, then re-group by the original document _id to collect the matching elements back into arrays. This is how you filter array contents — MongoDB cannot do it inline without $unwind when you need to aggregate on the filtered elements. For purely selecting documents based on array contents without aggregation, a $match with $elemMatch in a find() query is far more efficient.

$facet: Multi-Dimensional Aggregation in One Pass

$facet runs multiple sub-pipelines on the same set of input documents in a single aggregation pass, returning a single document whose fields contain the results of each sub-pipeline. This is the standard pattern for search result pages that need both filtered results and category counts simultaneously — without making multiple round-trips to the database.

// ── $facet: search page with results + counts in one query ────
db.products.aggregate([
  // Stage 1: filter to the search scope (uses index)
  {
    $match: {
      category:  { $in: ["electronics", "computers"] },
      price:     { $lt: 500 },
      inStock:   true,
    },
  },
  // Stage 2: $facet branches — each sub-pipeline is independent
  {
    $facet: {
      // Branch A: paginated results
      results: [
        { $sort: { price: 1 } },
        { $skip: 0 },    // page 1, offset 0
        { $limit: 20 },  // 20 per page
        {
          $project: {
            name:     1,
            price:    1,
            category: 1,
            imageUrl: 1,
          },
        },
      ],

      // Branch B: total count (for pagination)
      totalCount: [
        { $count: "count" },
      ],

      // Branch C: price range distribution
      priceRanges: [
        {
          $bucket: {
            groupBy:    "$price",
            boundaries: [0, 50, 100, 200, 500],
            default:    "500+",
            output: {
              count: { $sum: 1 },
            },
          },
        },
      ],

      // Branch D: category breakdown
      categoryBreakdown: [
        {
          $group: {
            _id:   "$category",
            count: { $sum: 1 },
          },
        },
        { $sort: { count: -1 } },
      ],

      // Branch E: average price and availability stats
      summary: [
        {
          $group: {
            _id:      null,
            avgPrice: { $avg: "$price" },
            minPrice: { $min: "$price" },
            maxPrice: { $max: "$price" },
          },
        },
        { $project: { _id: 0 } },
      ],
    },
  },
])
// Single output document:
// {
//   results:           [ {name: ..., price: ...}, ... ],        // 20 items
//   totalCount:        [ { count: 184 } ],
//   priceRanges:       [ { _id: 0, count: 12 }, { _id: 50, count: 45 }, ... ],
//   categoryBreakdown: [ { _id: "electronics", count: 102 }, ... ],
//   summary:           [ { avgPrice: 187.5, minPrice: 9.99, maxPrice: 499 } ]
// }

// ── Extract values in application code ────────────────────────
const [facetResult] = await db.collection("products").aggregate([...]).toArray()
const products    = facetResult.results
const total       = facetResult.totalCount[0]?.count ?? 0
const avgPrice    = facetResult.summary[0]?.avgPrice ?? 0
const pageCount   = Math.ceil(total / 20)

$facet loads all input documents into memory before branching — it cannot use indexes inside the sub-pipelines and does not support $out or $merge within branches. The input to $facet must fit within the 100 MB aggregation memory limit (or use allowDiskUse: true). For very large filtered sets, consider adding a more aggressive $match before $facet to reduce the input size, or running two separate aggregations instead.

Performance: Index Usage, $match Position, and explain()

MongoDB's aggregation planner optimizes pipelines before execution — it can push a leading $match into the query layer so it uses a collection index, and it can move a $sort before $group when the group key matches the sort key. Understanding these rules lets you write pipelines that perform well at scale.

// ── Read explain output for a pipeline ───────────────────────
// executionStats shows whether stages use IXSCAN or COLLSCAN
db.orders.explain("executionStats").aggregate([
  { $match: { status: "complete", amount: { $gt: 100 } } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
])
// Look for "queryPlanner.winningPlan.inputStage.stage"
// IXSCAN = uses index ✓
// COLLSCAN = full scan, consider adding an index ✗

// ── Create compound index for a common pipeline pattern ───────
// Pipeline: $match status + sort createdAt + group customerId
db.orders.createIndex(
  { status: 1, createdAt: -1 },   // covers $match on status, $sort on createdAt
  { name: "idx_status_date" }
)
// For $lookup join efficiency:
db.customers.createIndex({ _id: 1 }) // usually already exists as _id index

// ── Pipeline ordering rules ────────────────────────────────────
// Rule 1: $match first → uses collection index
db.orders.aggregate([
  { $match: { status: "complete" } },  // ✓ index on status
  { $group: { _id: "$region", revenue: { $sum: "$amount" } } },
])

// Rule 2: $sort before $group when sorting by group key
db.orders.aggregate([
  { $sort:  { customerId: 1, createdAt: 1 } }, // index on customerId + createdAt
  {
    $group: {
      _id:       "$customerId",
      firstDate: { $first: "$createdAt" },
      lastDate:  { $last:  "$createdAt" },
    },
  },
])

// Rule 3: $project / $addFields early to reduce document size
db.orders.aggregate([
  { $match: { status: "complete" } },
  {
    // Drop large fields early — reduces data flowing through later stages
    $project: {
      customerId: 1,
      amount:     1,
      region:     1,
      // NOT including large fields like "auditLog", "rawData"
    },
  },
  {
    $lookup: {
      from:         "customers",
      localField:   "customerId",
      foreignField: "_id",
      as:           "customer",
    },
  },
])

// ── allowDiskUse for large intermediate results ─────────────
// Required when any stage exceeds 100 MB
db.orders.aggregate(
  [
    { $match: { year: 2025 } },
    { $sort: { amount: -1 } },  // sorting 10M documents
    { $group: { _id: "$region", total: { $sum: "$amount" } } },
  ],
  { allowDiskUse: true }
)

// ── $merge: write aggregation results to a collection ─────────
// Useful for pre-computing expensive aggregations on a schedule
db.orders.aggregate([
  { $match: { status: "complete" } },
  { $group: { _id: "$region", total: { $sum: "$amount" } } },
  {
    $merge: {
      into:           "regionTotals",      // target collection
      on:             "_id",               // match field
      whenMatched:    "replace",
      whenNotMatched: "insert",
    },
  },
])

The most common aggregation performance mistake is placing $match after a $lookup or $unwind when that same filter could have appeared before them. If you need to filter on the joined data, use the pipeline form of $lookup with a $match inside the sub-pipeline — MongoDB can use the joined collection's indexes inside the sub-pipeline, keeping the join selective. Profile pipelines in production with db.setProfilingLevel(1, {'{ slowms: 100 }'}) and check db.system.profile.find({'{ op: "command" }'}).sort({'{ ts: -1 }'}).limit(5) for slow aggregation operations.

Using Aggregation with the Node.js Driver and Mongoose

The Node.js MongoDB driver and Mongoose both accept aggregation pipelines as plain JavaScript arrays. The driver returns an AggregationCursor for streaming large results; Mongoose's Model.aggregate() returns a plain array Promise. Neither applies schema validation or population to aggregation results — what MongoDB returns is what you get.

import { MongoClient } from 'mongodb'

// ── Native driver: AggregationCursor ──────────────────────────
const client = new MongoClient(process.env.MONGODB_URI!)
const db     = client.db('myapp')

// toArray() — collects all results into memory
const results = await db.collection('orders').aggregate([
  { $match:  { status: 'complete', amount: { $gt: 100 } } },
  { $group:  { _id: '$customerId', total: { $sum: '$amount' }, count: { $sum: 1 } } },
  { $sort:   { total: -1 } },
  { $limit:  10 },
]).toArray()
// results: [{ _id: "c42", total: 4800, count: 12 }, ...]

// for await...of — streaming for large result sets (O(1) memory)
const cursor = db.collection('events').aggregate([
  { $match: { eventType: 'purchase' } },
  { $project: { userId: 1, amount: 1, _id: 0 } },
])

for await (const doc of cursor) {
  await processEvent(doc) // process one document at a time
}

// ── allowDiskUse + maxTimeMS options ──────────────────────────
const bigResults = await db.collection('orders').aggregate(
  [
    { $match: { year: 2025 } },
    { $group: { _id: '$region', revenue: { $sum: '$amount' } } },
  ],
  {
    allowDiskUse: true,   // allow spilling to disk past 100 MB
    maxTimeMS: 30_000,    // 30-second timeout
    comment: 'revenue-by-region-2025', // visible in profiler
  }
).toArray()

// ── TypeScript: type the result ────────────────────────────────
interface RevenueByRegion {
  _id:     string  // region
  revenue: number
  orders:  number
}

const typed = await db
  .collection<{ region: string; amount: number; status: string }>('orders')
  .aggregate<RevenueByRegion>([
    { $match:  { status: 'complete' } },
    { $group:  { _id: '$region', revenue: { $sum: '$amount' }, orders: { $sum: 1 } } },
  ])
  .toArray()
// typed is RevenueByRegion[]

// ── Mongoose: Model.aggregate() ──────────────────────────────
import Order from './models/Order'

// Model.aggregate() returns Promise<any[]> — Mongoose DOES NOT
// apply schema transforms, virtuals, or getters to aggregation output
const monthlyRevenue = await Order.aggregate([
  {
    $match: {
      status: 'complete',
      createdAt: { $gte: new Date('2026-01-01') },
    },
  },
  {
    $group: {
      _id: {
        year:  { $year:  '$createdAt' },
        month: { $month: '$createdAt' },
      },
      revenue:    { $sum: '$amount' },
      orderCount: { $sum: 1 },
    },
  },
  { $sort: { '_id.year': -1, '_id.month': -1 } },
])

// ── Mongoose with Aggregate object (chainable) ────────────────
const pipeline = Order.aggregate()
  .match({ status: 'complete' })
  .group({ _id: '$region', total: { $sum: '$amount' } })
  .sort({ total: -1 })
  .limit(5)
  .option({ allowDiskUse: true })

const top5Regions = await pipeline.exec()

// ── Full pipeline example: top customers with order details ───
const topCustomers = await db.collection('orders').aggregate([
  // 1. Filter completed orders from last 90 days
  {
    $match: {
      status:     'complete',
      createdAt:  { $gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) },
    },
  },
  // 2. Group by customer
  {
    $group: {
      _id:        '$customerId',
      totalSpend: { $sum: '$amount' },
      orderCount: { $sum: 1 },
      avgOrder:   { $avg: '$amount' },
      lastOrder:  { $max: '$createdAt' },
    },
  },
  // 3. Filter to meaningful customers
  { $match: { orderCount: { $gte: 3 } } },
  // 4. Join customer info
  {
    $lookup: {
      from:         'customers',
      localField:   '_id',
      foreignField: '_id',
      as:           'customer',
    },
  },
  { $unwind: '$customer' },
  // 5. Reshape output
  {
    $project: {
      _id:         0,
      customerId:  '$_id',
      name:        '$customer.name',
      email:       '$customer.email',
      totalSpend:  1,
      orderCount:  1,
      avgOrder:    { $round: ['$avgOrder', 2] },
      lastOrder:   1,
    },
  },
  // 6. Sort by spend descending, limit to top 100
  { $sort:  { totalSpend: -1 } },
  { $limit: 100 },
]).toArray()

Mongoose's Model.aggregate() bypasses schema middleware, virtuals, and toJSON() transforms — the result is raw MongoDB documents, not Mongoose document instances. This means populate() does not work on aggregation results. Use $lookup inside the pipeline instead. When building complex pipelines programmatically, construct the JSON array step by step and test each stage incrementally in MongoDB Compass or the shell to verify intermediate document shapes before adding the next stage.

Key Terms

aggregation pipeline
A JSON array of stage documents passed to db.collection.aggregate(). Documents flow through each stage in sequence — the output of each stage is the input of the next. Stages are JSON objects with a single operator key ($match, $group, etc.) whose value is a JSON object configuring that stage. MongoDB supports over 30 stages; the most commonly used are $match, $group, $project, $lookup, $unwind, $sort, $limit, $skip, $facet, $count, $addFields, $replaceRoot, $out, and $merge. Unlike SQL, pipeline stages can appear multiple times and in any order.
accumulator
An accumulator operator computes a value across all documents within a $group group or across array elements in $project / $addFields. The group accumulators are: $sum (sum or count with 1), $avg (mean), $min (minimum), $max (maximum), $push (build array from all values), $addToSet (build deduplicated array), $first (first value in group after sort), $last (last value), $stdDevPop, and $stdDevSamp. Array expression operators like $sum and $avg also work on arrays within a single document when used in $project — for example {"$sum": "$items.price"} sums all prices in an array field.
$lookup
A $lookup stage performs a left outer join between two collections in the same MongoDB database. The basic form specifies from (joined collection), localField (field in the working document), foreignField (field in the joined collection), and as (name of the output array). The pipeline form (let + pipeline) enables filtering and transforming the joined documents before materializing the result, supports joins on multiple fields, and can run sub-aggregations on the joined data. $lookup is equivalent to SQL LEFT JOIN: documents in the working collection with no matches in the joined collection receive an empty array rather than being dropped.
$unwind
$unwind deconstructs a JSON array field into one output document per array element. A document with a 5-element array produces 5 output documents, each with the array field set to a single element. The path option specifies the array field path (e.g., "$items"). The preserveNullAndEmptyArrays option passes through documents where the array field is absent, null, or empty — without it those documents are silently dropped. The includeArrayIndex option adds a field containing the 0-based position of each element in the original array. $unwind is typically followed immediately by $match to filter elements and $group to re-aggregate.
$facet
$facet runs multiple independent sub-pipelines on the same input documents and returns a single output document containing the results of all sub-pipelines as named arrays. This enables computing multiple aggregations in one database round-trip — for example: paginated results, total count, price distribution, and category breakdown for a search results page. Each sub-pipeline is a full aggregation pipeline with its own stages, sorts, and limits. $facet must fit its input within the 100 MB aggregation memory limit and does not support $out or $merge inside sub-pipelines.
$expr
The $expr operator allows aggregation expressions to appear inside a $match stage or a find() query. Its primary use is field-to-field comparisons within the same document — for example {"$match": {"$expr": {"$gt": ["$sold", "$stock"]}}} matches documents where the sold field exceeds the stock field, which is impossible with regular query operators that compare a field to a constant. $expr with a comparison expression cannot use a standard B-tree index; it requires an expression index or performs a collection scan. Combine with a range $match that does use an index to reduce the document count before the $expr check runs.

FAQ

What is a MongoDB aggregation pipeline?

A MongoDB aggregation pipeline is a JSON array of stage documents passed to db.collection.aggregate(). Each stage is a JSON object with a single operator key such as $match, $group, $project, $lookup, $unwind, or $sort. Documents from the collection enter the first stage, are transformed or filtered by that stage, and the output flows into the next stage. MongoDB supports over 30 stages, and stages can appear multiple times — for example, a pipeline can have 2 $match stages (one for index-backed initial filtering, one after $group to filter aggregated results) and multiple $project stages. Aggregation pipelines replace the older MapReduce API entirely: they are faster, more expressive, and can run in parallel on sharded clusters. A typical 3-stage pipeline filters, groups, and sorts: [{"$match": {'{...}'}}, {"$group": {'{...}'}}, {"$sort": {'{...}'}}].

How do I filter documents in a MongoDB aggregation pipeline?

Use $match as the first stage of your pipeline. $match accepts the same query operators as db.collection.find(): equality checks, comparison operators ($gt, $lt, $gte, $lte), multi-value operators ($in, $nin), logical operators ($and, $or, $not), string operators ($regex), and array operators ($elemMatch). Placing $match first is critical for performance: MongoDB's query planner recognizes a leading $match and uses collection indexes just like a find() query, reducing the number of documents that downstream stages must process. A second $match stage after a $group filters the aggregated results — MongoDB 5.0+ can use the HAVING-equivalent pattern [{"$group": {'{...}'}}, {"$match": {"count": {"$gt": 5}}}]. For field-to-field comparisons inside $match, use the $expr operator.

How does $group work in MongoDB aggregation?

$group collapses multiple documents into one output document per unique value of the _id expression. The _id field is required and defines the grouping key — use a field path like "$category" to group by a single field, a compound object like { year: { $year: "$date" }, month: { $month: "$date" } } to group by multiple dimensions, or null to aggregate the entire collection into one document. Accumulator operators compute values across all documents in each group: {"$sum": "$amount"} sums the amount field; {"$sum": 1} counts documents (equivalent to SQL COUNT(*)); {"$avg": "$price"} computes the mean price; {"$push": "$item"} collects all item values into an array. A $group stage does not output documents in any guaranteed order — add $sort after it to order the results. MongoDB can process up to 100 MB of data in a $group stage by default; enable allowDiskUse: true for larger datasets.

How do I join two MongoDB collections with $lookup?

$lookup performs a left outer join. The basic form takes 4 fields: from (the collection to join), localField (the field in the current document), foreignField (the matching field in the joined collection), and as (the name of the output array field). For every document in the pipeline, MongoDB finds all matching documents in the from collection and appends them as an array to the as field. A document with 3 matching records gets an array of length 3; a document with no matches gets an empty array. Follow $lookup with $unwind: "$fieldName" to flatten the 1-element array into a single embedded object (for 1-to-1 joins). The pipeline form of $lookup adds a let object (exposing local field values as variables) and a pipeline array (a sub-pipeline that runs inside the joined collection). Use the pipeline form when you need to filter the joined documents with conditions involving both collections, join on multiple fields, or apply a sub-aggregation. Always index the foreignField in the joined collection so MongoDB can use a nested-loop join rather than a full collection scan.

What does $unwind do to JSON arrays in MongoDB?

$unwind deconstructs an array field in each input document into one output document per array element. Every non-array field is duplicated across all output documents. A document with a 5-element tags array produces 5 output documents — each identical except that tags holds a single string value instead of an array. This expansion multiplies the document count: 10,000 documents each with an average of 8 array elements produce 80,000 intermediate documents after $unwind. Always follow $unwind with $match or $project to reduce intermediate document size. By default, $unwind silently drops documents where the array field is absent, null, or empty — add preserveNullAndEmptyArrays: true to retain those documents with the field set to null. The includeArrayIndex option adds an integer field with the 0-based position of each element in the original array.

How do I add computed fields to MongoDB documents with $project?

In $project, set a field to an aggregation expression to compute its value. Arithmetic: {"$project": {'{ "total": { "$sum": "$items.price" }, "tax": { "$multiply": ["$amount", 0.08] } }'}}. String: {"$concat": ["$firstName", " ", "$lastName"]}. Conditional: {"$cond": {'{ "if": { "$gte": ["$score", 90] }, "then": "A", "else": "B" }'}}. Date decomposition: {"$year": "$createdAt"}, {"$month": "$createdAt"}, {"$dateToString": {'{ "format": "%Y-%m-%d", "date": "$createdAt" }'}}. Array operations: {"$size": "$items"} returns array length; {"$arrayElemAt": ["$items", 0]} gets the first element; {"$slice": ["$items", 3]} gets the first 3 elements; {"$filter": {'{ "input": "$items", "as": "item", "cond": { "$gt": ["$$item.price", 100] } }'}} filters array elements inline without $unwind. Type conversion: {"$toString": "$amount"}, {"$toDouble": "$stringAmount"}, {"$toDate": "$timestamp"}. The $addFields (or $set) stage does the same thing as $project for computed fields but keeps all existing fields — useful when you only want to add 1-2 fields without listing all others for inclusion.

How do I optimize a MongoDB aggregation pipeline for performance?

Five rules cover most optimization scenarios. First, place $match as the first pipeline stage — MongoDB's query planner recognizes this and uses collection indexes, reducing the working set from millions of documents to thousands or hundreds before any other stage runs. Second, check your pipeline with db.collection.explain("executionStats").aggregate([...]) and look for IXSCAN vs COLLSCAN in the winning plan. Third, add a $project or $addFields early to remove large fields (embedded arrays, blob data, audit fields) that downstream stages do not need — smaller documents flow faster through $lookup, $sort, and $group. Fourth, index the foreignField in any collection referenced by $lookup; without an index, $lookup scans the entire joined collection for every document in the pipeline. Fifth, enable allowDiskUse: true when intermediate results exceed 100 MB — $sort and $group over large datasets spill to disk rather than failing. Set maxTimeMS to enforce query timeouts on production pipelines.

How do I run a MongoDB aggregation pipeline in Node.js?

With the official MongoDB Node.js driver, call collection.aggregate(pipeline) where pipeline is a JavaScript array of stage objects — the same JSON structure used in the MongoDB shell. The call returns an AggregationCursor. Call .toArray() to collect all results into a JavaScript array; use for await...of cursor to iterate documents one at a time without loading all into memory (necessary for large result sets). Pass options as the second argument: { allowDiskUse: true, maxTimeMS: 30000 }. For TypeScript, pass the expected document type as a generic: collection.aggregate<OutputType>([...]). With Mongoose, use Model.aggregate(pipeline) which returns a Promise<any[]> — Mongoose applies no schema middleware, virtuals, or toJSON() transforms to aggregation output. The chainable Mongoose Aggregate object (Model.aggregate().match().group().sort()) builds the pipeline incrementally and is useful when pipeline stages are conditional. Neither the driver nor Mongoose validate the pipeline structure before sending it to MongoDB — a stage typo like $grup returns an error from the server at runtime.

Explore MongoDB JSON documents

Use Jsonic's JSON formatter to inspect and validate MongoDB query and aggregation pipeline JSON before running it against your collection.

Open JSON Formatter

Further reading and primary sources