I have a collection of documents in mongodb and I want to compute the CDF for some of the attributes and return or store it in the db. Obviously adding a new attribute to each document isn't a good approach, and I'm fine with an approximation I can later use. This is more of a theoretical question.
So I went with computing a sampling of the CDF on discrete intervals with a mapreduce job, like this (just the algorithm):
- Get the
count
,min
andmax
of attributesomeAttr
- Suppose
min = 5
,max=70
,count = 200
. - In
map()
:for (i=this.someAttr; i < max+1; i++) { emit(i, 1) }
- In
reduce()
just return the sum for each key. - In
finalize()
, divide the reduced output by the record count:return val / count
.
This does output a collection with samples from the CDF, however..
As you see the interval step here is 1
, but the huge inefficiency in this approach is that there can be a monstrous amount of emitting even from a single document, even with just a handful of documents in the colletion, hence this is obviously not scalable and will not work.
The output looks like this:
{ _id: 5, val: 0}
{ _id: 6, val: 0.04}
{ _id: 7, val: 0.04}
...
{ _id: 71, val: 1.0}
From here I can easily get an approximated value of CDF for any of the values or even interpolate between them if that's reasonable.
Could someone give me an insight into how would you compute a (sample of) CDF with MapReduce (or perhaps without MapReduce)?
By definition, the cumulative distribution function
F_a
for an attributea
is defined bySo you can compute the CDF with
The count in the denominator assumes you don't want to count documents missing the
a
field. An index ona
will make this fast.You can use this to compute samples of the cdf or just compute the cdf on demand. There's no need for map-reduce.