Universal hashing performs worse than modulo hashing, what is wrong?

304 views Asked by At

In case you are not familiar with universal hashing, it's mainly an attempt to guarantee a low number of collisions (as opposed, say with using plain old modulo), using some rather simple math involving randomness. The problem is it doesn't work for me:

size_t hash_modulo(const int value) {
    return (size_t) (value % TABLE_SIZE);
}

// prime 491 is used because its > 128, which is the size of the hash table
size_t hash_universal(const int value) {
    const size_t a = (size_t) (rand() % 491 + 1);
    const size_t b = (size_t) (rand() % 491);
    //printf("a: %zu, b:%zu\n", a, b);
    return ((a * value + b) % 491) % TABLE_SIZE;
}

I test modulo hashing first and determine the longest chain length (chain length means a hash bucket size):

size_t get_max_chain_length(int input[TABLE_SIZE], size_t (*hash_function)(const int)) {
    HashTable *hash_table = hash_table_create(hash_function);
    if (!hash_table) {
        return 0;
    }

    for (size_t i = 0; i < TABLE_SIZE; ++i) {
        hash_table_add(hash_table, input[i]);
    }

    size_t maximum_chain_length = 0;
    for (int j = 0; j < TABLE_SIZE; ++j) {
        const size_t length = length_of_(hash_table->rows[j]);
        maximum_chain_length = (length > maximum_chain_length) ? length : maximum_chain_length;
    }

    //hash_table_print(hash_table);
    hash_table_destroy(hash_table);

    return maximum_chain_length;
}

I pick one of the inputs which led to a really big chain (id est one which performs bad using plain modulo) and throw this one against universal hashing. Universal hashing uses randomness so I can take a constant input and still get varying results.

And here comes the problem. I try 100 random input arrays of size 128 each and calculate the average longest chain and the total longest chain, but both algorithms perform similar.

You can check my main in my repo.

My question is: Is that result to be expected? Does universal hashing perform not any better with input which already performed poor using modulo? Or did I just screw up my implementation (more likely).

Thanks a lot in advance!

2

There are 2 answers

3
SomeWittyUsername On

Well, why do you think modulo is bad? If the input is random and sufficiently large, the modulo should yield a uniformly distributed result. Uniform hashing (as your link states) provides protection against non-random (i.e., malicious) input, which isn't the case here.

0
Armali On

In case you are not familiar with universal hashing, it's mainly an attempt to guarantee a low number of collisions …

An "attempt to guarantee" is no guarantee.

… (as opposed, say with using plain old modulo)…

The linked article says that even simpler hash functions [using modulo] are approximately universal.

I generate random input and chose one which leads to a big bucket using modulo. Then I use exactly the same input with universal to check if the result improves. And according to the specification, the maximum chain length should at least reduce a bit.

100 random input arrays of size 128 each aren't a particularly large input. I ran the program from your repo eight times; in five of those runs, your universal hashing reduced the "Average maximum chain length" by around 10 %; in three runs, your universal hashing increased the "Average maximum chain length" by a similar amount. I notice that the maximum chain length with universal hashing is constant within each run.

To sum up, there's no guarantee that one hash method is always better than another, and your universal hashing seems to keep its performance promise yet by being better more often than not.