Dynamic equal_to function unordered_map boost

361 views Asked by At

I have an unordered map string to int which uses a custom equal_to function defined as:

bool hashEqual::operator ()(const string &a, const string &b) const
{
    if (a.size() != b.size())
        return false;

    return std::inner_product(
        a.begin(), a.end(), b.begin(),
        0, std::plus<unsigned int>(),
        std::not2(std::equal_to<std::string::value_type>())
        ) <= 8;  
}           

Basically what it does is if two keys have a hamming distance equal or less than 8, then is the same key.

The thing is I want the distance threshold to be dynamic in order to let the user set it through the command line. Instead of 8, a variable threshold or something like this.

I'm not looking for a hack like a global variable (unless it's the only way to achieve this) but for the "good way".

2

There are 2 answers

0
Jcao02 On BEST ANSWER

I figured it out.

All is done in the class hashEqual. I changed the definition like this:

class hashEqual {
    private:
        int th;
    public:
       hashEqual();
        hashEqual(int th) { this->th = th; }; // This implemetation on the .cpp
        bool operator ()(const string &a, const string &b) const;
};

the operator() implementation:

bool hashEqual::operator ()(const string &a, const string &b) const
{
    if (a.size() != b.size())
        return false;

    return std::inner_product(
        a.begin(), a.end(), b.begin(),
        0, std::plus<unsigned int>(),
        std::not2(std::equal_to<std::string::value_type>())
        ) <= this->th;  
}   

And in the constructor of the unordered_map:

boost::unordered_map<string, unsigned int, boost::hash<string>, hashEqual> myMap(size, boost::hash<string>(), hashEqual(threshold));
0
Tony Delroy On

Why `unordered_map` doesn't work reliably

A good general-purpose hash function maps keys to buckets in a repeatable but otherwise seemingly random way, by which I mean that if the key varies by even a single bit then the bucket should be statistically unrelated - as if you'd picked another at random. So, say you have a hash table with some existing elements:

[ bucket 0 - "abcde fghij" ]
[ bucket 1 - <empty> ]
[ bucket 2 - <empty> ]
[ bucket 3 - "01234 56789", "77777 QQQQQ" ]  (2 colliding values for this bucket)
[ bucket 4 - "XXXXX YYYYY" ]
[ bucket 5 - <empty> ]

If you come along to insert say "Abcde fghij" then you could hash to any of these buckets - you should have no more chance of that being bucket 0 than any of the others, but if that bucket is not bucket 0 then you'll never even attempt a hamming-distance-aware equality comparison against "abcde fghij".


Why `multimap` doesn't work reliably

Imagine we a multimap with some existing strings (S1 through S6 in increasing lexicographical sort order - each with a hamming distance of more than 8 from the other elements) in it, the actual balanced binary tree might look something vaguely like:

            S4
          /    \
        S2       S6
       /  \     /  \
      S1   S3  S5

Now, let's say S1 happens to be "Abcde fghij", S4 is "ZZZZZ ZZZZZ" and we go to insert "abcde fghij":

  • even with hamming distance comparison, "ZZZZZ ZZZZZ" < "abcde fghij" (remember that 'Z' < 'a' in ASCII order) so the multimap expects "abcde fghij" to be stored in the right hand side of the tree...

  • "abcde fghij" is then compared to S6, and if less S5, and will be inserted accordingly, but crucially there's never any comparison with S1


Which brings me back to my earlier comment:

I don't think there's any simple and correct way to do the comparisons other than brute force (try every combination). And results vary for same data in another order.