Relationship between (1) hash function, (2) length of signature and (3) jaccard similarity?

1.2k views Asked by At

I am trying to understand/implement minHash based jaccard similarity in python. The main goal is use it in MapReduce. However I am not clear how the choice of hash function and length of signature affects error rate in computing jaccard similarity. From wikipedia, I found that in general length of signature (K) and error (e) associated with the computed jaccard similarity is k = O(1/e^2). I tried implementing minHash in python:

import random
import sys

#ERROR_THRESHOLD = 0.05
#SIG_LENGTH = int(1/(ERROR_THRESHOLD**2))
_memomask = {}

def hash_values(n, x):
    """Compute n different hash values"""
    values = []
    for i in range(n):
        mask = _memomask.get(i)
        if mask is None:
            random.seed(i)
            mask = _memomask[i] = random.getrandbits(32)
        values.append((hash(str(x)) % mask))
    return values


def compare_signatures(x, y):
    """Compare MinHash Signatures"""
    size = len(x)

    if size != len(y): raise Exception("Different signature length")
    if size == 0: raise Exception("signature length is zero")

    counter = 0
    for i in range(size): counter += int(x[i] == y[i])
    return counter/float(size)

items = [['A',3], ['A',6], ['A',9], ['B',2], ['B',4], ['B',6], ['B',8]]

for SIG_LENGTH in [1, 10, 100, 400, 1000]:
    #Step 1: Compute Hash Signature for each token
    data = []
    for item in items:
        values = hash_values(SIG_LENGTH, item[1])
        key = item[0]    
        data.append((key, values))

    #Step 2: Group by Key and compute MinHash for each index
    signatures = {}
    for item in data:
        key = item[0]
        values = item[1]
        if key not in signatures: signatures[key] = [-1.0]*SIG_LENGTH
        cur_signature = signatures[key]   

        signatures[key] = [(values[i] if cur_signature[i] == -1.0 else min(values[i], cur_signature[i])) for i in range(SIG_LENGTH)]

    #Step 3: Compute Probability of minHash signature to be same
    keys = signatures.keys()
    key_length = len(keys)
    print "Jaccard Similarity based on signature of length {0}".format(SIG_LENGTH)
    for i in range(key_length):
        x_key = keys[i]
        x_sig = signatures[x_key]
        for j in range(i+1,key_length):
            y_key = keys[j]
            y_sig = signatures[y_key]
            print "J({0},{1}) = {2}".format(x_key, y_key, compare_signatures(x_sig, y_sig))

In my test, I found that accuracy increases as the length of signature increases but then it starts decreasing (or remains stable) thereafter. I am wondering is it because of the choice of hash function. If yes, can someone please suggest a good hash function to use.

I found some related post but still not clear: How many hash functions are required in a minhash algorithm

3

There are 3 answers

0
apublius On

You asked 1) what is the optimal number of hashes for a minhash algorithm and 2) whether you were using the correct hash function.

1) You mentioned: k = O(1/e^2). This is correct if e refers to error. You can also express this as expected error (epsilon) is on the Order (1/k**0.5). Remember this is the average expected error at which this algorithm will converge, not necessarily the expected error for a particular comparison.

2) You can use any random hash function as long as each hash is salted differently. 64-bit hashes are probably the smallest I'd recommend. I'd avoid using MD5 or SHA since you don't need that overhead here. Be sure to take the modulus of the hash by size of your operating system eg. sys.maxsize() in Python. If you don't do this, then you will run into instance where the algorithm is not behaving correctly.

0
Marat On

md5 and sha work pretty well:

import random
import hashlib
import sys

k = int(sys.argv[1])
salts = [random.getrandbits(32) for i in range(k)]

def h(value, salt):
    m = hashlib.md5() #or hashlib.sha1()
    m.update(str(value))
    m.update(str(salt))
    return m.digest()

def get_signatures(A):
    return [min([h(x, salt) for x in A]) for salt in salts]

def compare_signatures(A, B):
    """Compare MinHash Signatures"""
    sigA = get_signatures(A)
    sigB = get_signatures(B)
    return sum(map(lambda x: int(sigA[x] == sigB[x]), range(k)))/float(k)

A = [3,6,9]
B = [2,4,6,8]

print compare_signatures(A, B)

and some tests:

$ for((i=10;i<2000;i*=10)); do python minhash.py $i; done
0.2
0.14
0.163
0
jaguarpaw On

One way to generate a large number of hash functions is to use different seeds. Like in createHashFunctions here.