what are the options for obtaining k pair-wise independent hash functions that are fast

715 views Asked by At

I ran into the needs of k pair-wise independent hash functions, each takes as input an integer, and produces a hash value in the range of 0-N. Need this for count-min sketch, which is similar to Bloom filter.

Formally, I need h_1,h_2, ..., h_k hash functions, pair-wise independent.

(h_i(n) mod N ) will give the hash value of n in the range of 0-N. The hashing needs to be time-efficient as I am working with a large set of data. At the same time, they should be as pair-wise independent as possible.

What I have tried so far:

1) xxhash: It is efficient, but it is not good in terms of pair-wise independent, meaning there are hash collisions between hash functions (meaning h1(n1)=h1(n2) then some h_k(n1) also = h_k(n2)) and the result i got was bad due to this.

2) Similarly, the famous integer hashing method ((a*n+b) mod p) mod N also has the same problem as xxhash. I believe this is called Universal hashing

3) The other one introduced in count-min-sketch produces quite good results, but takes too much time for a large input.

4) Also tried Murmur3, sha1 with similar problems in collisions.

Any idea would be greatly appreciated. C/C++ preferred, but Java would also be fine, or simply algorithm. Thanks

1

There are 1 answers

0
Yuriy On

I suspect that your problem with method 2 is that you tossed a_i and b_i that were correlated.
Work in large field (somewhere near 2^64) and for starters make sure all a_i and b_i are different (i.e., you get 2*k different numbers). If they are uniformly distributed inside the field this also wouldn't hurt :)

You might encountered the same problem in method 4 with SHA. Most cryptographic hash functions (including even the broken and older ones) are much more than enough for data structures needs, be it k-wise independence for any reasonable k, or almost any other property.
I would recheck - how you used it?