I'm wondering whether it would be a good idea to use hashes (CityHash, Murmur and the like) as keys in a key-value store like Hazelcast. I'm expecting to have about 2,000,000,000 records (URLs) in the database, so collisions could happen. It wouldn't be super critical to lose some data through hash collisions, but of course it would be best to avoid them.
A record contains the URL, time stamp, status code. The main operations are inserting and looking up whether an URL already exists.
So, what would you suggest, given speed is relevant:
- using an ID generator, or
- using a hash algorithm like CityHash or Murmur, or
- using the relevant string, an URL in this case, itself?
Hazelcast does not rely on hashCode/equals methods of the key object, instead it is using the MurMur hash of the binary representation of the key.
In short, you should not really worry about hash collisions.