What hash function should I use for a bloom-filter with +128-bit keys?

3.2k views Asked by At

https://github.com/joeyrobert/bloomfilter uses Random class for the hash function which is a performance killer.
What I'm trying to do is input the class with byte[]s instead of a generic argument(T) and get rid of

    private int Hash(T item) {
        return item.GetHashCode();
    }

I know there is a huge performance benefit but I have no idea how to replace _random.Next(_bitSize) here:

#region Public Methods
/// <summary>
/// Adds an item to the bloom filter.
/// </summary>
/// <param name="item">Item to be added</param>
public void Add(T item)
{
    _random = new Random(Hash(item));

    for (int i = 0; i < _numberOfHashes; i++)
        _bitArray[_random.Next(_bitSize)] = true;
}

With some non-retarded line of code that doesn't take thousands of CPU cycles for every single bit.

I know there are lots of other problems with the code that can make it faster/safer.I have them(mostly) fixed and just got stuck on the last one before pushing my changes.
Any help is really appreciated.

2

There are 2 answers

3
atlaste On BEST ANSWER

I lack to see why you would want to use a random number generator here... however, I can help you to speed things up.

A bloom filter is basically a bitvector, where you set bits. If you want to figure out if an item exists, the bloom filter will give you a true if the item possibly exists and a false if the item for sure doesn't exist.

(I'm doing this in a simple text editor so there might be some bugs in the code)

I'm going to assume your hash space here can use 32-bit integer calculations; if you have a very large bloom table, you probably want to use a 64-bit integer.

The easiest (and probably the fastest) implementation of a bloom filter is thus:

byte[] bloomFilter = new byte[MyBloomFilterSize];

foreach (var item in myItems) 
{
    int hash = Hash(item) & 0x7FFFFFFF;
    int bit = 1 << (hash & 7); // you have 8 bits
    int index = (hash >> 3) % MyBloomFilterSize;
    bloomFilter[hash % MyBloomFilterSize] |= bit;
}

You can experiment with changing the byte[] to a uint[] or a ulong[]; I'm not sure if that makes a difference.

If you then want to check if an item might exist, you calculate the same index and bit, and get the result.

public bool PossiblyExists(MyItem item)
{
    int hash = Hash(item) & 0x7FFFFFFF;

    int bit = 1 << (hash & 7); // you have 8 bits
    int index = (hash >> 3) % MyBloomFilterSize;
    return (bloomFilter[hash % MyBloomFilterSize] & bit) != 0;
}

The only thing that remains here is the speed at which you can calculate the hash. If you're using an integer, I'd simply multiply it with a large prime number; if you're using a SHA256 fixed-length byte[] (which you seem to be doing), you need to make it an integer (or long).

I'm using a little trick here with Buffer.BlockCopy to convert the types. Just for safety I prefer using a few bytes from the data, but since SHA256 should already be random, a simple BitConverter.ToInt32(data, [0..28]) should also do the trick.

public int CalculateHash(byte[] data) 
{
    // Data = >128 bits = >16 bytes -- which is the same as >4 integers

    int[] tmp = new int[4];
    Buffer.BlockCopy(data, 0, tmp, 0, data.Length);
    return tmp[0] ^ tmp[1] ^ tmp[2] ^ tmp[3];
}

That should do it.

0
Thomas Mueller On

An efficient implementation would be for example the following. If you have a hash function that returns 64 bit, then better use this instead of murmur3_64. Warning: I didn't test it.

void Add(string item) {
    ulong hash = murmur3_64((ulong) item.GetHashCode());
    uint a = (uint) (hash >> 32);
    uint b = (uint) hash;
    for (int i = 0; i < k; i++) {
        _bitArray[reduce(a, _bitSize)] = true;
        // "Less Hashing, Same Performance: Building a Better Bloom Filter"
        a += b;
    }
}

ulong murmur3_64(ulong x) {
    x = (x ^ (x >> 33)) * 0xff51afd7ed558ccdL;
    x = (x ^ (x >> 23)) * 0xc4ceb9fe1a85ec53L;
    x = x ^ (x >> 33);
    return x;
}

uint reduce(uint hash, uint n) {
    // http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
    return (hash * n) >> 32;
}