cmph Minimal perfect hashing

528 views Asked by At

I've spent days trying to make the library work on my system. The library has several algorithms which generate MPHFs. My understanding of minimal hash function is, that when I hash two distinct keys using the MPHF, they'll return two different ids. This does not seem to be the case with the 2 million keys that I've generated (integers, read as string by the algorithm). I've tried couple of algorithms that the library implements but all of them result in duplicate 'ids' for a lot of keys.

Here is what I've written:

#include <cmph.h>
#include <iostream>
#include <fstream>
#include <bitset>
#include <string>
#include <sstream>
#include <limits.h>

using namespace std;

int main(int argc, char** argv){

    FILE *fp = fopen("keys.txt", "r");
    FILE *read = fopen("keys2.txt", "r");
    ofstream ids("ids2.txt");

    if(!fp || !read || !ids.is_open()){
        cerr<<"Failed to open the file\n";
        exit(1);
    }

    cmph_t* hash = NULL;
    // source of keys
    cmph_io_adapter_t *source = cmph_io_nlfile_adapter(fp);
    cmph_config_t *config = cmph_config_new(source);
    cmph_config_set_algo(config, CMPH_BDZ);
    hash = cmph_new(config);
    cmph_config_destroy(config);

    char *k = (char *)malloc(sizeof(12));

    while(fgets(k, INT_MAX, read) != NULL){
        string key = k;
        unsigned int id = cmph_search(hash, k, (cmph_uint32)key.length());
        ids<<id<<"\n";
    }

    cmph_destroy(hash);
    cmph_io_nlfile_adapter_destroy(source);
    fclose(fp);
    fclose(read);
    ids.close();
}

Shouldn't the ids be unique for every distinct key if the algorithm claims to generate a minimal perfect hash function? There are 2048383 keys. For my project I would need the ids to map from 0 to 2048382, since I plan to use a minimal perfect hash function. I am not sure where I am going wrong with my understanding. Please help.

1

There are 1 answers

0
Pavel P On

If your keys2.txt contains keys that weren't part of the set that was used to generate your hash, then, by definition of the mphf, you'll get either duplicate hashes or, possibly, values out of your range. It's up to you to store all keys that were used to generate hash and then verify that the key that was passed to cmph_search was the same as the one that resulted in the hash/id returned by cmph_search