Relationship between (1) hash function, (2) length of signature and (3) jaccard similarity?

Question

Relationship between (1) hash function, (2) length of signature and (3) jaccard similarity?

1.2k views Asked by user3142979 At 29 December 2013 at 01:29

I am trying to understand/implement minHash based jaccard similarity in python. The main goal is use it in MapReduce. However I am not clear how the choice of hash function and length of signature affects error rate in computing jaccard similarity. From wikipedia, I found that in general length of signature (K) and error (e) associated with the computed jaccard similarity is k = O(1/e^2). I tried implementing minHash in python:

import random
import sys

#ERROR_THRESHOLD = 0.05
#SIG_LENGTH = int(1/(ERROR_THRESHOLD**2))
_memomask = {}

def hash_values(n, x):
    """Compute n different hash values"""
    values = []
    for i in range(n):
        mask = _memomask.get(i)
        if mask is None:
            random.seed(i)
            mask = _memomask[i] = random.getrandbits(32)
        values.append((hash(str(x)) % mask))
    return values


def compare_signatures(x, y):
    """Compare MinHash Signatures"""
    size = len(x)

    if size != len(y): raise Exception("Different signature length")
    if size == 0: raise Exception("signature length is zero")

    counter = 0
    for i in range(size): counter += int(x[i] == y[i])
    return counter/float(size)

items = [['A',3], ['A',6], ['A',9], ['B',2], ['B',4], ['B',6], ['B',8]]

for SIG_LENGTH in [1, 10, 100, 400, 1000]:
    #Step 1: Compute Hash Signature for each token
    data = []
    for item in items:
        values = hash_values(SIG_LENGTH, item[1])
        key = item[0]    
        data.append((key, values))

    #Step 2: Group by Key and compute MinHash for each index
    signatures = {}
    for item in data:
        key = item[0]
        values = item[1]
        if key not in signatures: signatures[key] = [-1.0]*SIG_LENGTH
        cur_signature = signatures[key]   

        signatures[key] = [(values[i] if cur_signature[i] == -1.0 else min(values[i], cur_signature[i])) for i in range(SIG_LENGTH)]

    #Step 3: Compute Probability of minHash signature to be same
    keys = signatures.keys()
    key_length = len(keys)
    print "Jaccard Similarity based on signature of length {0}".format(SIG_LENGTH)
    for i in range(key_length):
        x_key = keys[i]
        x_sig = signatures[x_key]
        for j in range(i+1,key_length):
            y_key = keys[j]
            y_sig = signatures[y_key]
            print "J({0},{1}) = {2}".format(x_key, y_key, compare_signatures(x_sig, y_sig))

In my test, I found that accuracy increases as the length of signature increases but then it starts decreasing (or remains stable) thereafter. I am wondering is it because of the choice of hash function. If yes, can someone please suggest a good hash function to use.

I found some related post but still not clear: How many hash functions are required in a minhash algorithm

Original Q&A

There are 3 answers

**apublius** · Answer 1 · 2015-01-13T22:58:54+00:00

You asked 1) what is the optimal number of hashes for a minhash algorithm and 2) whether you were using the correct hash function.

1) You mentioned: k = O(1/e^2). This is correct if e refers to error. You can also express this as expected error (epsilon) is on the Order (1/k**0.5). Remember this is the average expected error at which this algorithm will converge, not necessarily the expected error for a particular comparison.

2) You can use any random hash function as long as each hash is salted differently. 64-bit hashes are probably the smallest I'd recommend. I'd avoid using MD5 or SHA since you don't need that overhead here. Be sure to take the modulus of the hash by size of your operating system eg. sys.maxsize() in Python. If you don't do this, then you will run into instance where the algorithm is not behaving correctly.

**Marat** · Answer 2 · 2014-02-02T15:05:52+00:00

md5 and sha work pretty well:

import random
import hashlib
import sys

k = int(sys.argv[1])
salts = [random.getrandbits(32) for i in range(k)]

def h(value, salt):
    m = hashlib.md5() #or hashlib.sha1()
    m.update(str(value))
    m.update(str(salt))
    return m.digest()

def get_signatures(A):
    return [min([h(x, salt) for x in A]) for salt in salts]

def compare_signatures(A, B):
    """Compare MinHash Signatures"""
    sigA = get_signatures(A)
    sigB = get_signatures(B)
    return sum(map(lambda x: int(sigA[x] == sigB[x]), range(k)))/float(k)

A = [3,6,9]
B = [2,4,6,8]

print compare_signatures(A, B)

and some tests:

$ for((i=10;i<2000;i*=10)); do python minhash.py $i; done
0.2
0.14
0.163

**jaguarpaw** · Answer 3 · 2014-02-20T02:02:57+00:00

jaguarpaw On 20 February 2014 at 02:02

One way to generate a large number of hash functions is to use different seeds. Like in createHashFunctions here.

TechQA.

Relationship between (1) hash function, (2) length of signature and (3) jaccard similarity?

There are 3 answers

Related Questions in PYTHON

Related Questions in HASH-FUNCTION

Related Questions in MINHASH

Popular Questions

Popular Tags

Trending Questions