Cumulative distribution in MongoDB using MapReduce

279 views Asked by At

I have a collection of documents in mongodb and I want to compute the CDF for some of the attributes and return or store it in the db. Obviously adding a new attribute to each document isn't a good approach, and I'm fine with an approximation I can later use. This is more of a theoretical question.

So I went with computing a sampling of the CDF on discrete intervals with a mapreduce job, like this (just the algorithm):

  1. Get the count, min and max of attribute someAttr
  2. Suppose min = 5, max=70, count = 200.
  3. In map(): for (i=this.someAttr; i < max+1; i++) { emit(i, 1) }
  4. In reduce() just return the sum for each key.
  5. In finalize(), divide the reduced output by the record count: return val / count.

This does output a collection with samples from the CDF, however..

As you see the interval step here is 1, but the huge inefficiency in this approach is that there can be a monstrous amount of emitting even from a single document, even with just a handful of documents in the colletion, hence this is obviously not scalable and will not work.

The output looks like this:

{ _id: 5, val: 0}
{ _id: 6, val: 0.04}
{ _id: 7, val: 0.04}
...
{ _id: 71, val: 1.0}

From here I can easily get an approximated value of CDF for any of the values or even interpolate between them if that's reasonable.

Could someone give me an insight into how would you compute a (sample of) CDF with MapReduce (or perhaps without MapReduce)?

1

There are 1 answers

1
wdberkeley On BEST ANSWER

By definition, the cumulative distribution function F_a for an attribute a is defined by

F_a(x) = # documents with attribute value <= x / # of documents

So you can compute the CDF with

F_a(x) = db.collection.count({ "a" : { "lte" : x }) / db.collection.count({ "a" : { "$exists" : true } })

The count in the denominator assumes you don't want to count documents missing the a field. An index on a will make this fast.

You can use this to compute samples of the cdf or just compute the cdf on demand. There's no need for map-reduce.