I am trying to understand/implement minHash based jaccard similarity in python. The main goal is use it in MapReduce. However I am not clear how the choice of hash function and length of signature affects error rate in computing jaccard similarity. From wikipedia, I found that in general length of signature (K) and error (e) associated with the computed jaccard similarity is k = O(1/e^2). I tried implementing minHash in python:
import random
import sys
#ERROR_THRESHOLD = 0.05
#SIG_LENGTH = int(1/(ERROR_THRESHOLD**2))
_memomask = {}
def hash_values(n, x):
"""Compute n different hash values"""
values = []
for i in range(n):
mask = _memomask.get(i)
if mask is None:
random.seed(i)
mask = _memomask[i] = random.getrandbits(32)
values.append((hash(str(x)) % mask))
return values
def compare_signatures(x, y):
"""Compare MinHash Signatures"""
size = len(x)
if size != len(y): raise Exception("Different signature length")
if size == 0: raise Exception("signature length is zero")
counter = 0
for i in range(size): counter += int(x[i] == y[i])
return counter/float(size)
items = [['A',3], ['A',6], ['A',9], ['B',2], ['B',4], ['B',6], ['B',8]]
for SIG_LENGTH in [1, 10, 100, 400, 1000]:
#Step 1: Compute Hash Signature for each token
data = []
for item in items:
values = hash_values(SIG_LENGTH, item[1])
key = item[0]
data.append((key, values))
#Step 2: Group by Key and compute MinHash for each index
signatures = {}
for item in data:
key = item[0]
values = item[1]
if key not in signatures: signatures[key] = [-1.0]*SIG_LENGTH
cur_signature = signatures[key]
signatures[key] = [(values[i] if cur_signature[i] == -1.0 else min(values[i], cur_signature[i])) for i in range(SIG_LENGTH)]
#Step 3: Compute Probability of minHash signature to be same
keys = signatures.keys()
key_length = len(keys)
print "Jaccard Similarity based on signature of length {0}".format(SIG_LENGTH)
for i in range(key_length):
x_key = keys[i]
x_sig = signatures[x_key]
for j in range(i+1,key_length):
y_key = keys[j]
y_sig = signatures[y_key]
print "J({0},{1}) = {2}".format(x_key, y_key, compare_signatures(x_sig, y_sig))
In my test, I found that accuracy increases as the length of signature increases but then it starts decreasing (or remains stable) thereafter. I am wondering is it because of the choice of hash function. If yes, can someone please suggest a good hash function to use.
I found some related post but still not clear: How many hash functions are required in a minhash algorithm
You asked 1) what is the optimal number of hashes for a minhash algorithm and 2) whether you were using the correct hash function.
1) You mentioned: k = O(1/e^2). This is correct if e refers to error. You can also express this as expected error (epsilon) is on the Order (1/k**0.5). Remember this is the average expected error at which this algorithm will converge, not necessarily the expected error for a particular comparison.
2) You can use any random hash function as long as each hash is salted differently. 64-bit hashes are probably the smallest I'd recommend. I'd avoid using MD5 or SHA since you don't need that overhead here. Be sure to take the modulus of the hash by size of your operating system eg. sys.maxsize() in Python. If you don't do this, then you will run into instance where the algorithm is not behaving correctly.