server side set intersection in mongodb

1.9k views Asked by At

In an application I am working on, a requirement is to do massive set intersection, to the tune of 10-1,000,000 items or so. The items that we are intersecting are simply ObjectId's.

So for instance there is a boxes document and inside the boxes document there is an item_ids Array. This item_ids array for each box holds 10-1,000,000 ObjectId's.

The end goal here is to say, given box A with ObjectId 4d3dc3898951498107000005, and box B with ObjectId 4d3dc3898951498107000002, which item_ids do they have in common?

Here is how im doing it:

db.boxes.distinct("item_ids", {'_id' : {$in : [ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}})

Firstly just curious if this seems like a sane approach. In my research so far it seems like map reduce is a common suggestion for large intersections, but that it is not recommended for realtime queries.

Secondly, curious how this would behave in a sharded environment? Will mongos run a chunk of the query on the mongod's it needs to and aggregate my result magically?

Lastly, if the above is sane, is it also sane to do:

db.items.find({'_id' : { $in : db.eval(function() {return db.boxes.distinct("item_ids", {_id:{$in:[ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}}); }) }}) 

Which would basically be finding which items both box A and box B have in common, and then materializing them into objects all in one server side query. This appears to also work with .limit and .skip to effectively implement a paging of the data set.

Anyhow, any feedback is valuable, thanks!

1

There are 1 answers

2
mstearn On BEST ANSWER

I think you may want to reconsider your schema. If you have 1,000,000 ObjectIDs in an array at 12 bytes each that is 12MB not even counting the BSON overhead which can be significant for large arrays* (probably another 8MB or so). In 1.8 we are raising the max document size from 4MB to 16MB, but even that won't be enough for the objects you are looking to store.

*For historical reasons we store the stingified index for each element in the array which is fine when you have <100 elements, but adds up when you need 6 or 7 digits.