Difference Between Placing `$match` Before or After `$unwind` in MongoDB Aggregate for My BooksDB

108 views Asked by At

Hello MongoDB community,

I am trying to understand the difference in the results and performance implications between two aggregate queries in MongoDB. Both queries aim to count the number of books each author has written, but they place the $match stage at different points in the pipeline.

Here's the first query:

db.books.aggregate([
  { $match: { 'authors': { $ne: '' } } },
  { $unwind: '$authors' },
  { $group: { _id: '$authors', numBooks: { $count: {} } } },
  { $sort: { 'numBooks': -1 } },
  { $limit: 5 }
])

And here is the second one:

db.books.aggregate([
  { $unwind: '$authors' },
  { $match: { 'authors': { $ne: '' } } },
  { $group: { _id: '$authors', numBooks: { $count: {} } } },
  { $sort: { 'numBooks': -1 } },
  { $limit: 5 }
])

The first query filters out documents where the 'authors' array is empty before unwinding the array, while the second one unwinds the 'authors' array first and then filters out any documents where the 'authors' field is an empty string.

output of my queries

Shouldn't the output of both queries be identical? If I $unwind the array and subsequently filter out the non-empty sets or if I first eliminate empty arrays and then proceed with the $unwind operation on the remaining documents with authors, the final result should be the same, right?

The schema of my database is as follows:

I'm pondering this because the 'authors' field is an array of strings, and it seems like regardless of whether the $match stage comes before or after the $unwind stage, the operations should theoretically filter out the same data. Yet, does the sequence of $unwind and $match somehow influence the outcome in ways I'm not accounting for? Could someone explain the difference between these two approaches?

Thanks

I've thoroughly reviewed the database and individually examined the entries for movies by the varying authors, but I haven't found anything unusual. And I still don't understand why it's yielding different results.

1

There are 1 answers

1
aneroid On BEST ANSWER

The difference between matching first & then unwinding vs unwinding first and then matching is the effect of $match with arrays vs $match with strings.

1. In the case of $match first, { 'authors': { $ne: '' } } - if any of the authors in the array is an empty string, then it excludes the whole document, not just that one item in the array.

So when you match first, all the books which have at least one empty string as an author get eliminated. And then there are fewer books to $unwind in the next stage.

Here's a shorter example reproducing the behaviour:

Collection:

[
  { _id: 1, names: [ "Alice", "Bob", "Carol" ] },
  { _id: 2, names: [ "David", "", "Elle" ] },
]

Check which documents match

db.collection.aggregate([
  {
    $match: { "names": { $ne: "" } }
  }
])

Result is the whole 1st document and no parts of the 2nd:

[
  { _id: 1, names: [ "Alice", "Bob", "Carol" ] },
]

Mongo Playground

2. When you $unwind first and then $match, you're matching a string with a string (not an array of strings) - so it would exclude just that one unwound 'row'/record - which means the other id=2 records for 'David' and 'Elle' remain after the stage.