Hello MongoDB community,
I am trying to understand the difference in the results and performance implications between two aggregate queries in MongoDB. Both queries aim to count the number of books each author has written, but they place the $match
stage at different points in the pipeline.
Here's the first query:
db.books.aggregate([
{ $match: { 'authors': { $ne: '' } } },
{ $unwind: '$authors' },
{ $group: { _id: '$authors', numBooks: { $count: {} } } },
{ $sort: { 'numBooks': -1 } },
{ $limit: 5 }
])
And here is the second one:
db.books.aggregate([
{ $unwind: '$authors' },
{ $match: { 'authors': { $ne: '' } } },
{ $group: { _id: '$authors', numBooks: { $count: {} } } },
{ $sort: { 'numBooks': -1 } },
{ $limit: 5 }
])
The first query filters out documents where the 'authors' array is empty before unwinding the array, while the second one unwinds the 'authors' array first and then filters out any documents where the 'authors' field is an empty string.
Shouldn't the output of both queries be identical? If I $unwind the array and subsequently filter out the non-empty sets or if I first eliminate empty arrays and then proceed with the $unwind operation on the remaining documents with authors, the final result should be the same, right?
The schema of my database is as follows:
I'm pondering this because the 'authors' field is an array of strings, and it seems like regardless of whether the $match stage comes before or after the $unwind stage, the operations should theoretically filter out the same data. Yet, does the sequence of $unwind and $match somehow influence the outcome in ways I'm not accounting for? Could someone explain the difference between these two approaches?
Thanks
I've thoroughly reviewed the database and individually examined the entries for movies by the varying authors, but I haven't found anything unusual. And I still don't understand why it's yielding different results.
The difference between matching first & then unwinding vs unwinding first and then matching is the effect of
$match
with arrays vs $match with strings.1. In the case of
$match
first,{ 'authors': { $ne: '' } }
- if any of the authors in the array is an empty string, then it excludes the whole document, not just that one item in the array.So when you match first, all the
books
which have at least one empty string as an author get eliminated. And then there are fewer books to$unwind
in the next stage.Here's a shorter example reproducing the behaviour:
Collection:
Check which documents match
Result is the whole 1st document and no parts of the 2nd:
Mongo Playground
2. When you
$unwind
first and then$match
, you're matching a string with a string (not an array of strings) - so it would exclude just that one unwound 'row'/record - which means the otherid=2
records for 'David' and 'Elle' remain after the stage.