Storing hash tables on mongodb with fast increment-by updates

417 views Asked by At

I'd like to store word histograms on mongodb very similarly to how a hash table is saved in memory. My input data comes from an ordinary relational database or a .csv file.

shape  | color | size
*************************
circle | blue  | 5 
square | red   | 3
circle | red   | 10

I'm generating a histogram for each batch of data as a hashmap of hashmaps:

{shape: {circle: 2, square: 1},
 color: {blue: 1, red: 2},
 size: {5: 1, 3: 1, 10: 1}}

I'd like to save the histogram in mongodb and once a new batch of data arrives, merge it with the current histogram.

Using Scala with Casbah, I've tried to store the histogram of each column as a document and each value (Circle, square, etc) as a key in the document.

{
    _id: "shape",
    circle: 2,
    square: 3
}

For the same data, inserts took 5 seconds and updates(upserts) took >300 seconds.

I've also tried to store the histogram of each column as a collection and each value as a document. Initially I generate a collection for each column and create a hashed index for _id, I use the $inc update command to increase the counter, I also combine all of the upsert queries with a bulk operation.

Document:

{
    _id: "circle",
    count: 2
}

Code to generate collections:

val db = mongoClient("test")
val column_names = List("shape", "color", "size")
column_names.foreach(column => 
     db(column).createIndex(MongoDBObject("_id" -> "hashed"))

Code to upsert data:

def upsertMongo(h: mutable.HashMap[String, mutable.HashMap[String, Long]]) = {
    val mongoClient = MongoClient("localhost", 27017)
    val db = mongoClient("test")
    h.foreach{case(column_name, histogram) =>
        val mongo_collection = db(column_name)
        val builder = mongo_collection.initializeUnorderedBulkOperation
        histogram.foreach{case(word, count) =>
            val query = MongoDBObject("_id" -> word)
            val update = $inc("count" -> count)
            builder.find(query).upsert().updateOne(update)
        }
        builder.execute()
    }
}

It runs much faster this way but still not fast enough, ~30 seconds upserts vs. 5 seconds inserts. How can I make the upserts work as fast as the inserts? Is it even possible? (Overall data is small enough to be cached completely)

0

There are 0 answers