How to group by different fields

Question

How to group by different fields

579 views Asked by Michael K. At 09 June 2015 at 11:32

I want to find all users named 'Hans' and aggregate their 'age' and number of 'childs' by grouping them. Assuming I have following in my database 'users'.

{
    "_id" : "01",
    "user" : "Hans",
    "age" : "50"
    "childs" : "2"
}
{
    "_id" : "02",
    "user" : "Hans",
    "age" : "40"
    "childs" : "2"
}
{
    "_id" : "03",
    "user" : "Fritz",
    "age" : "40"
    "childs" : "2"
}
{
    "_id" : "04",
    "user" : "Hans",
    "age" : "40"
    "childs" : "1"
}

The result should be something like this:

"result" : 
[
  { 
    "age" : 
      [
        {
          "value" : "50",
          "count" : "1"
        },
        {
          "value" : "40",
          "count" : "2"
        }
      ]
  },
  { 
    "childs" : 
      [
        {
          "value" : "2",
          "count" : "2"
        },
        {
          "value" : "1",
          "count" : "1"
        }
      ]
  }  
]

How can I achieve this?

Original Q&A

There are 2 answers

Sylvain Leroux On 09 June 2015 at 12:30

That was a tough one!

First, the bare solution:

db.test.aggregate([
 { "$match": { "user": "Hans" } },
 // duplicate each document: one for "age", the other for "childs"
 { $project: { age: "$age", childs: "$childs",
               data: {$literal: ["age", "childs"]}}},
 { $unwind: "$data" },
 // pivot data to something like { data: "age", value: "40" }
 { $project: { data: "$data",
               value: {$cond: [{$eq: ["$data", "age"]},
                               "$age", 
                               "$childs"]} }},
 // Group by data type, and count
 { $group: { _id: {data: "$data", value: "$value" }, 
             count: { $sum: 1 }, 
             value: {$first: "$value"} }},
 // aggregate values in an array for each independant (type,value) pair
 { $group: { _id: "$_id.data", values: { $push: { count: "$count", value: "$value" }} }} ,
 // project value to the correctly name field
 { $project: { result: {$cond: [{$eq: ["$_id", "age"]},
                               {age: "$values" }, 
                               {childs: "$values"}]} }},
 // group all data in the result array, and remove unneeded `_id` field 
 { $group: { _id: null, result: { $push: "$result" }}},
 { $project: { _id: 0, result: 1}}
])

Producing:

{
    "result" : [
        {
            "age" : [
                {
                    "count" : 3,
                    "value" : "40"
                },
                {
                    "count" : 1,
                    "value" : "50"
                }
            ]
        },
        {
            "childs" : [
                {
                    "count" : 1,
                    "value" : "1"
                },
                {
                    "count" : 3,
                    "value" : "2"
                }
            ]
        }
    ]
}

And now, for some explanations:

One of the major issues here is that each incoming document has to be part of two different sums. I solved that by adding a literal array ["age", "childs"] to your documents, and then unwinding them by that array. That way, each document will be presented twice in the later stage.

Once that done, to ease processing, I change the data representation to something much more manageable like { data: "age", value: "40" }

The following steps will perform the data aggregation per-se. Up to the third $project step that will map the value fields to the corresponding age or childs field.

The final two steps will simply wrap the two documents in one, removing the unneeded _id field.

Pfff!

**AudioBubble** · Accepted Answer · 2015-06-09T13:48:27+00:00

This should almost be a MongoDB FAQ, mostly because it is a real example concept of how you should be altering your thinking from SQL processing and embracing what engines like MongoDB do.

The basic principle here is "MongoDB does not do joins". Any way of "envisioning" how you would construct SQL to do this essentially requires a "join" operation. The typical form is "UNION" which is in fact a "join".

So how to do it under a different paradigm? Well first, let's approach how not to do it and understand the reasons why. Even if of course it will work for your very small sample:

The Hard Way

db.docs.aggregate([
    { "$group": {
        "_id": null,
        "age": { "$push": "$age" },
        "childs": { "$push": "$childs" }
    }},
    { "$unwind": "$age" },
    { "$group": {
        "_id": "$age",
        "count": { "$sum": 1  },
        "childs": { "$first": "$childs" }
    }},
    { "$sort": { "_id": -1 } },
    { "$group": {
        "_id": null,
        "age": { "$push": {
            "value": "$_id",
            "count": "$count"
        }},
        "childs": { "$first": "$childs" }
    }},
    { "$unwind": "$childs" },
    { "$group": {
        "_id": "$childs",
        "count": { "$sum": 1 },
        "age": { "$first": "$age" }
    }},
    { "$sort": { "_id": -1 } },
    { "$group": {
        "_id": null,
        "age": { "$first": "$age" },
        "childs": { "$push": {
            "value": "$_id",
            "count": "$count"
        }}
    }}
])

That will give you a result like this:

{
    "_id" : null,
    "age" : [
            {
                    "value" : "50",
                    "count" : 1
            },
            {
                    "value" : "40",
                    "count" : 3
            }
    ],
    "childs" : [
            {
                    "value" : "2",
                    "count" : 3
            },
            {
                    "value" : "1",
                    "count" : 1
            }
    ]
}

So why is this bad? The main problem should be apparent in the very first pipeline stage:

    { "$group": {
        "_id": null,
        "age": { "$push": "$age" },
        "childs": { "$push": "$childs" }
    }},

What we asked to do here is group up everything in the collection for the values we want and $push those results into an array. When things are small then this works, but real world collections would result in this "single document" in the pipeline that exceeds the 16MB BSON limit that is allowed. That is what is bad.

The rest of the logic follows the natural course by working with each array. But of course real world scenarios would almost always make this untenable.

You could avoid this somewhat, by doing things like "duplicating" the documents to be of "type" "age or "child" and grouping documents individually by type. But it's all a bit to "over complex" and not a solid way of doing things.

The natural response is "what about a UNION?", but since MongoDB does not do the "join" then how to approach that?

A Better Way ( aka A New Hope )

Your best approach here both architecturally and performance wise is to simply submit "both" queries ( yes two ) in "parallel" to the server via your client API. As the results are received you then "combine" them into a single response you can then send back as a source of data to your eventual "client" application.

Different languages have different approaches to this, but the general case is to look for an "asynchronous processing" API that allows you to do this in tandem.

My example purpose here uses node.js as the "asynchronous" side is basically "built in" and reasonably intuitive to follow. The "combination" side of things can be any type of "hash/map/dict" table implementation, just doing it the simple way for example only:

var async = require('async'),
    MongoClient = require('mongodb');

MongoClient.connect('mongodb://localhost/test',function(err,db) {

  var collection = db.collection('docs');

  async.parallel(
    [
      function(callback) {
        collection.aggregate(
          [
            { "$group": {
              "_id": "$age",
              "type": { "$first": { "$literal": "age" } },
              "count": { "$sum": 1 }
            }},
            { "$sort": { "_id": -1 } }
          ],
          callback
        );
      },
      function(callback) {
        collection.aggregate(
          [
            { "$group": {
              "_id": "$childs",
              "type": { "$first": { "$literal": "childs" } },
              "count": { "$sum": 1 }
            }},
            { "$sort": { "_id": -1 } }

          ],
          callback
        );
      }
    ],
    function(err,results) {
      if (err) throw err;
      var response = {};
      results.forEach(function(res) {
        res.forEach(function(doc) {
          if ( !response.hasOwnProperty(doc.type) )
            response[doc.type] = [];

          response[doc.type].push({
            "value": doc._id,
            "count": doc.count
          });
        });
      });

      console.log( JSON.stringify( response, null, 2 ) );
    }
  );
});

Which gives the cute result:

{
  "age": [
    {
      "value": "50",
      "count": 1
    },
    {
      "value": "40",
      "count": 3
    }
  ],
  "childs": [
    {
      "value": "2",
      "count": 3
    },
    {
      "value": "1",
      "count": 1
    }
  ]
}

So the key thing to note here is that the "separate" aggregation statements themselves are actually quite simple. The only thing you face is combining those in your final result. There are many approaches to "combining", particularly to deal with large results from each of the queries, but this is the basic example of the execution model.

Key points here.

Shuffling data in the aggregation pipeline is possible but not performant for large data sets.
Use a language implementation and API that support "parallel" and "asynchronous" execution so you can "load up" all or "most" of your operations at once.
The API should support some method of "combination" or otherwise allow a separate "stream" write to process each result set received into one.
Forget about the SQL way. The NoSQL way delegates the processing of such things as "joins" to your "data logic layer", which is what contains the code as shown here. It does it this way because it is scalable to very large datasets. It is rather the job of your "data logic" handling nodes in large applications to deliver this to the end API.

This is fast compared to any other form of "wrangling" I could possibly describe. Part of "NoSQL" thinking is to "Unlearn what you have learned" and look at things a different way. And if that way doesn't perform better, then stick with the SQL approach for storage and query.

That's why alternatives exist.

TechQA.

How to group by different fields

There are 2 answers

The Hard Way

A Better Way ( aka A New Hope )

Related Questions in MONGODB

Related Questions in AGGREGATION-FRAMEWORK

Popular Questions

Trending Questions