I'd like to store word histograms on mongodb very similarly to how a hash table is saved in memory. My input data comes from an ordinary relational database or a .csv file.
shape | color | size
*************************
circle | blue | 5
square | red | 3
circle | red | 10
I'm generating a histogram for each batch of data as a hashmap of hashmaps:
{shape: {circle: 2, square: 1},
color: {blue: 1, red: 2},
size: {5: 1, 3: 1, 10: 1}}
I'd like to save the histogram in mongodb and once a new batch of data arrives, merge it with the current histogram.
Using Scala with Casbah, I've tried to store the histogram of each column as a document and each value (Circle, square, etc) as a key in the document.
{
_id: "shape",
circle: 2,
square: 3
}
For the same data, inserts took 5 seconds and updates(upserts) took >300 seconds.
I've also tried to store the histogram of each column as a collection and each value as a document.
Initially I generate a collection for each column and create a hashed index for _id, I use the $inc
update command to increase the counter, I also combine all of the upsert queries with a bulk operation.
Document:
{
_id: "circle",
count: 2
}
Code to generate collections:
val db = mongoClient("test")
val column_names = List("shape", "color", "size")
column_names.foreach(column =>
db(column).createIndex(MongoDBObject("_id" -> "hashed"))
Code to upsert data:
def upsertMongo(h: mutable.HashMap[String, mutable.HashMap[String, Long]]) = {
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient("test")
h.foreach{case(column_name, histogram) =>
val mongo_collection = db(column_name)
val builder = mongo_collection.initializeUnorderedBulkOperation
histogram.foreach{case(word, count) =>
val query = MongoDBObject("_id" -> word)
val update = $inc("count" -> count)
builder.find(query).upsert().updateOne(update)
}
builder.execute()
}
}
It runs much faster this way but still not fast enough, ~30 seconds upserts vs. 5 seconds inserts. How can I make the upserts work as fast as the inserts? Is it even possible? (Overall data is small enough to be cached completely)