calculate pairwise simhash "distances"

1.8k views Asked by At

I want to construct a pairwise distance matrix where the "distances" are the similarity scores between two strings as implemented here. I was thinking of using sci-kit learn's pairwise distance method to do this, as I've used it before for other calculations and the easy parallelization is great.

Here's the relevant piece of code:

def hashdistance(str1, str2):
    hash1 = simhash(str1)
    hash2 = simhash(str2)

    distance = 1 - hash1.similarity(hash2)

    return distance   


strings = [d['string'] for d in data]
distance_matrix = pairwise_distances(strings, metric = lambda u,v: hashdistance(u, v))

strings looks like ['foo', 'bar', 'baz'].

When I try this, it throws the error ValueError: could not convert string to float. This might be a really dumb thing to say, but I'm not sure why the conversion would need to happen here, and why it's throwing that error: the anonymous function in metric can take strings and return a float; why do the inputs need to be floats, and how can I create this pairwise distance matrix based on simhash 'distances'?

1

There are 1 answers

0
Phillip Cloud On BEST ANSWER

According to the documentation, only metrics from scipy.spatial.distance are allowed, or a callable from:

In [26]: sklearn.metrics.pairwise.pairwise_distance_functions
Out[26]:
{'cityblock': <function sklearn.metrics.pairwise.manhattan_distances>,
 'euclidean': <function sklearn.metrics.pairwise.euclidean_distances>,
 'l1': <function sklearn.metrics.pairwise.manhattan_distances>,
 'l2': <function sklearn.metrics.pairwise.euclidean_distances>,
 'manhattan': <function sklearn.metrics.pairwise.manhattan_distances>}

One issue is that if metric is callable then sklearn.metrics.pairwise.check_pairwise_arrays tries to convert the input to float, (scipy.spatial.distance.pdist does something similar, so you're out of luck there) thus your error.

Even if you could pass a callable it wouldn't scale very well, since the loop in pairwise_distances is pure Python. It looks like you'll have to just write the loop yourself. I would suggest reading the source code of pdist and/or pairwise_distances for hints as to how to do this.