Optimize duplicate values in NoSql key-value storage

457 views Asked by At

I am building a maps tile storage, and need to store 1.5 billion ~3KB blobs. Over 95% of them are duplicate. Is there a NoSQL storage engine that would avoid storing identical values?

I could of course implement a double-de-referencing, e.g. key->hash->value. If hash is MD5, 16 byte hashes would use up 24GB for hashess alone, plus the per-item overhead, which is probably much more. Anything more efficient?

Thanks!

1

There are 1 answers

0
Zim-Zam O'Pootertoot On BEST ANSWER

Double de-referencing is the way to go - you'd be saving somewhere between 4-5TB of data by not storing duplicate data, so storing a 24GB set of hashes is worth the cost. Additionally, you only need to compute the hash function on inserts and updates, not on lookups or deletions.

To reduce the cost of double de-referencing on lookups, you can supplement your on-disk key-value database with an in-memory cache, e.g. Redis - you can either cache frequently accessed key->hash pairs to avoid two lookups on the primary database, or else you can directly store the entire key->hash->blob structure in the cache (the former is much simpler to implement because you don't need to replicate the double de-referencing from the primary database, whereas the latter makes more sense if only a small subset of the blobs are ever active).

You may be able to use a simpler/smaller hash - the probability of a hash collision is 1 - e^(-k^2 / 2N) where k is the number of values being hashed and N is the size of the hash, so a good 64-bit hash has about a 12% chance of having a collision and a good 128-bit hash has an infinitesimal chance of having a collision. MurmurHash has 64 and 128-bit versions so you can experiment between the two, and it's faster than MD5 largely owing to MD5 being a cryptographic hash function whereas Murmur doesn't have the added expense/complexity of being cryptographically secure (I'm assuming that you're not concerned about anybody attempting to intentionally generate hash collisions or anything like that). Some key-value stores also make it relatively easy to make your design collision-tolerant, for example you could store the hash in a Riak Map with a flag indicating whether there have been any collisions on that hash value - if false then simply return the blob, else fall back on option 2 (e.g. the indexed blob becomes the two blobs with a hash collision zipped/tarred together along with a CSV of which keys correspond to which blob; even with a 64-bit hash this code path will not be exercised very often, and so implementation simplicity likely trumps performance); the question is whether the reduced memory/hashing overhead makes up for the complexity of collision tolerance.