LSH: practice of solving nearest neigbors search

296 views Asked by At

"LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table."

Referring to the answer on another question I am looking for clarification of the process of LSH analysis. Suppose I have sparse feature vectors (binary, mostly 0) and would like to use cosine distance as the measure with a threshold alpha, which might vary.

  1. My first step is to compute the hash for each of the vectors. Does distance measure matter? (I suppose yes). Does threshold matters? (I suppose no). How can I find the appropriate hash-function?

    If programming, I would have function like:

    bytes[] getHash(Vector featureVec)

    Then I would put results in the Map(long vectorId, bytes[] hashcode) <-vectorHashMap

  2. Then I make hash table from hashes (putting hashs into bins). I suppose at least here should the threshold matter. How can I do that?

    If programming, it would be like:

    Map,Map createHashTable(Map vectorHashMap, long threshold)

    which returns two maps: Map of (hashCode, bucketId) and Map of (bucketId, ListOfVectorIds).

  3. Then i could easily retrieve the neigbors having vectorId as input and a list of vectorIds as output.

1

There are 1 answers

0
Sean Owen On

The hash has nothing to do with the distance measure. You get the each bit of the hash by dotting the vector with a randomly-chosen vector. The bit says what side of the random vector (a hyperplane really) the hashed vector is on. The bits together are a hash.

Yes then you index your vectors by hash value for easy retrieval. You don't need a 'bucket ID' -- your hash is your bucket.

The only catch here is that it is not true that all the nearest vectors are in the bucket that it hashed too. They merely tend to be near. If it mattered, you might have to search 'similar' buckets -- ones that differ in just a few bits -- to consider more candidates and do a better job of finding truly closest neighbors.