Hash Table and Collision Calculation result

495 views Asked by At

I know I don't have a working code/minimal, but I am not asking for more help in the code. I am going to try to summarize as much as possible. The test runs for 1000 times trying to insert 50 persons into the table. The trial randomly generates keys based on getRandomPersonalNumber.

The linear probing function returns if there are any collisions, updates the index if needed, and searches if the keys match the index. Now in the result, the only thing that seems weird is Table 1 I have asked some friends about the result, and they said that perhaps modulo 100 is doing something and that why I am getting the high collusion result in Table 1.

I am concerned that it is something perhaps with my calculations, but then again, it is only happening with 100 modulo, so I don't know can I calculate precisely the amount of collision without relying on the code some math perhaps? Lastly, is there a way to calculate a good middle ground for the amount of storage vs the amount of collusion(load factor)?

typedef struct
{
    struct Bucket *table; 

} HashTable;

static int hash(Key key, int tablesize)
{
    return (key % tablesize);
}

static int addPersonsToTable(HashTable *htable, const Person *personArray, int amount)
{
    int collissions = 0, i;
    for (i = 0; i < amount; i++)
    {
        int key = personArray[i].personalNumber;
        collissions += insertElement(htable, key, personArray[i]);
    }
    return collissions;
}

static int getRandomPersonalNumber(void)
{
    int day = rand() % 30 + 1; 
    int month = rand() % 12 + 1;
    int year = rand() % 60 + 40;
    return day + 100 * month + 10000 * year;
}

int insertElement(HashTable *htable, const Key key, const Value value)
{
    int coll;
    int index = hash(key, htable->size);
    coll = linearProbe(htable, key, &index);
    if (coll ==0 || index > -1)
    {
        htable->table[index].key = key;
        htable->table[index].value = value;
    }
    else
    {

    }

    return coll;
}

Table test.

-- Summary ----------------------
Average collisions on 1000 runs. Inserted 50 persons.
Table 1 (size 100) - average number of collisions: 516 - load factor: 0.50
Table 2 (size 150) - average number of collisions: 26 - load factor: 0.33
Table 3 (size 200) - average number of collisions: 68 - load factor: 0.25
Table 4 (size 250) - average number of collisions: 12 - load factor: 0.20
Table 5 (size 300) - average number of collisions: 26 - load factor: 0.17
Table 6 (size 350) - average number of collisions: 7 - load factor: 0.14
Table 7 (size 400) - average number of collisions: 16 - load factor: 0.13
Table 8 (size 450) - average number of collisions: 5 - load factor: 0.11
----------------------------------
1

There are 1 answers

3
dominicm00 On

This is a natural result of how hashtables work; even if your hashes are highly unique, when mapped into a small range of possible values, there will be a substantial number of collisions. Assuming your hash function is perfectly random, the expected load factor is usedSpace/availableSpace.

That said, you seem to be under the mistaken impression that your .5 load factor is inefficient. This load factor is perfectly fine; Java, for example, has a default load factor of .75! Linear searches on very few elements are highly efficient, so there isn't a real performance penalty as long as the number of elements in each hash location is low.

What's more concerning than the total number of hash collisions, is if there are a high number of hash collisions in the same place, i.e. your hash is not random enough. The more elements are stuffed into a single hash location, the more linear your search time becomes; this is why guarantees on your hash function are more important than guarantees on hash collisions.

EDIT: while the above is correct, the abnormally high collisions on the table size of 100 (which I misread: oops) are due to the modulo being a factor of the month and year multiples: see the comment below.