Given there are billions cookies, UUID like strings, what is the best way to test collision rate of say 32 bit hash function like murmur3 on this sample?
First of all it is hard to generate billions of unique strings as it is impossible to keep it in memory and there is no 100% precise random string generator.
Only way I can think of is :
- generating them and using approx. datastructures like bloomfilter or cuckoo filter to discard possible duplicates. Then we would have say exactly 5B of unique UUIDs stored in a file.
- iterate through them, hash them and repeat step 1) with the hash codes while counting how many collisions are there.
Is there any better way of doing that? This still has a drawback in that there is a certain false positive rate while testing the hash codes in 2). The hash codes would have to be written to file too, being manually checked in case of possible false positive hit.
the murmur_32 collision rate is extremely high in these magnitudes ...
Only 100M unique uuids has
1.145577 %
collision rate precisely ...Scala snippet