Faster implementation of LSH (AND-OR)

2.2k views Asked by At

I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?

Here is how I have done LSH - Minhash signature length (n) = 200, sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 
                 length 40'
for each-user in 160000:
  for each-band in 40:
    r_signature = string(jth 5 elements)
    if r_signature in bucket-of-ids[band]:
       'add id of user to dictionary of band 
        using r_signature as key'
    else :
       'create r_signature as new key and then 
        add user id to it as list of values'

The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.

Here is the link to code as well as data : https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.

Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.

1

There are 1 answers

1
negas On

Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.