Combine $lte and $gte in mongodb for random sample from unknown source doc size

12 views Asked by At

I have a variation of an old mongodb question - on the drawing of 10 sample docs from a collection of docs including a search condition.

Starting point, I have this collection:

var CategoriesSchema   = new Schema({

    referenceID: {type: Schema.Types.ObjectId, index: true },  
    name: {type: String, index: true},
    ...
    counter: Number

});

There may be hundreds, potential thousands of categories in the collection (i.e. different "names"), each with a different referenceID that represents some other data.

I want to draw a random sample of 10 matching docs with a certain name, e.g. 10 matching docs with the name "Technology" - if there are only say 7 such docs, I want to get them all, but if there are 400, I want a (pseudo)random sample of 10. Using the cookbook answer for random samples (Random record from MongoDB), I can use a random field (0 to 1) in the schema and use a compound index for "name" and "random" to get 10 random docs.

var CategoriesSchema   = new Schema({

   referenceID: {type: Schema.Types.ObjectId, index: true },  
   name: {type: String, index: true},
   random: Math.random(),
   ...
   counter: Number
});

So, as explained in the referenced Stack Overflow answer, I would have to combine an $lte and $gte query for a random number to find my 10 (pseudo)random records - in particular because if there are fewer than 10 records and I just use $lte, I will not get all of them back each time (because 2 might be above my random number used for $lte and 5 below). So essentially, I will always be querying for 20 records (10 above and 10 below), even though I just want 10 - or have to do first a $lte query and if it does not yield 10, have to do a follow on second $gte query. In addition to all this, the cookbook answer comes with all the restrictions for a pseudo-random samples (getting "clusters" of the same docs for similar random queries etc.).

My question is whether anyone has a better solution that is still efficient for my parameters.

My analysis is:

  1. $sample doesn't work efficiently due to my requirement to query for name as a condition (?)
  2. skip with limit is not recommended due to scaling problems (if all my categories only have say a max 100 docs, fine, I could use skip with limit to draw random docs, but what if a few categories have 10,000 docs?)
  3. geospatial queries are not suitable due to the compound index nature of my schema and requirements - i.e. like A, I would need geospatial queries with a condition which I understand do not work efficiently for large records (?)

So I should just go with the cookbook answer, include the random field and my compound index with name and random and query for $lte and $gte with 10 docs each? Or is there any alternative solution anyone has implemented?

0

There are 0 answers