In case you are not familiar with universal hashing, it's mainly an attempt to guarantee a low number of collisions (as opposed, say with using plain old modulo), using some rather simple math involving randomness. The problem is it doesn't work for me:
size_t hash_modulo(const int value) {
return (size_t) (value % TABLE_SIZE);
}
// prime 491 is used because its > 128, which is the size of the hash table
size_t hash_universal(const int value) {
const size_t a = (size_t) (rand() % 491 + 1);
const size_t b = (size_t) (rand() % 491);
//printf("a: %zu, b:%zu\n", a, b);
return ((a * value + b) % 491) % TABLE_SIZE;
}
I test modulo hashing first and determine the longest chain length (chain length means a hash bucket size):
size_t get_max_chain_length(int input[TABLE_SIZE], size_t (*hash_function)(const int)) {
HashTable *hash_table = hash_table_create(hash_function);
if (!hash_table) {
return 0;
}
for (size_t i = 0; i < TABLE_SIZE; ++i) {
hash_table_add(hash_table, input[i]);
}
size_t maximum_chain_length = 0;
for (int j = 0; j < TABLE_SIZE; ++j) {
const size_t length = length_of_(hash_table->rows[j]);
maximum_chain_length = (length > maximum_chain_length) ? length : maximum_chain_length;
}
//hash_table_print(hash_table);
hash_table_destroy(hash_table);
return maximum_chain_length;
}
I pick one of the inputs which led to a really big chain (id est one which performs bad using plain modulo) and throw this one against universal hashing. Universal hashing uses randomness so I can take a constant input and still get varying results.
And here comes the problem. I try 100 random input arrays of size 128 each and calculate the average longest chain and the total longest chain, but both algorithms perform similar.
You can check my main in my repo.
My question is: Is that result to be expected? Does universal hashing perform not any better with input which already performed poor using modulo? Or did I just screw up my implementation (more likely).
Thanks a lot in advance!
Well, why do you think modulo is bad? If the input is random and sufficiently large, the modulo should yield a uniformly distributed result. Uniform hashing (as your link states) provides protection against non-random (i.e., malicious) input, which isn't the case here.