example from LSHForest, results not convinced

Question

example from LSHForest, results not convinced

1.6k views Asked by user381509 At 24 June 2015 at 00:16

The library and corresponding documentation is following -- yes i read everything and being able to "run" on my own codes.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html

But the results do not really make sense to me, so i went through the example (which is included at the previous webpage as well)

    >>> from sklearn.neighbors import LSHForest
    >>> X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
    >>> X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]
    >>> lshf = LSHForest()
    >>> lshf.fit(X_train)  
    LSHForest(min_hash_match=4, n_candidates=50, n_estimators=10,
              n_neighbors=5, radius=1.0, radius_cutoff_ratio=0.9,
              random_state=None)
    >>> distances, indices = lshf.kneighbors(X_test, n_neighbors=2)
    >>> distances                                        
        array([[ 0.069...,  0.149...],
               [ 0.229...,  0.481...],
               [ 0.004...,  0.014...]])
    >>> indices
        array([[1, 2],
               [2, 0],
               [4, 0]])

so i just try to verify the example by finding the nearest neighbors for the the three testing sets [9, 1, 6], [3, 1, 10], [7, 10, 3]

Say searching nearest neighbors for [9,1,6] (by using Euclidean distance), the closest training points are [5, 5, 2] and [6, 10, 2] (which i think the indices would [0.4]) -- that is significantly different to the results [1,2]

the distances also completely off the topic by the simple math calculation, my excel sheet is attached

thanks again for your time and help

Original Q&A

There are 1 answers

**Aramis7d** · Accepted Answer · 2015-06-24T09:01:19+00:00

It's not wrong, since LSHForest implements ANN (approximate near neighbor), and maybe that's the difference we need to take into consideration. The ANN-results are not the nearest neighbors, but an approximation of what the nearest neighbor should be.

For example, a 2-nearest neighbor result looks like:

from sklearn.neighbors import NearestNeighbors

X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]

nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X_train)
distances, indices = nbrs.kneighbors(X_test)

and returns

indices
Out[2]: 
array([[0, 2],
       [0, 2],
       [4, 3]], dtype=int64)

distances
Out[3]: 
array([[ 6.92820323,  9.43398113],
       [ 9.16515139,  9.21954446],
       [ 1.41421356,  2.44948974]])

If it helps, checkout this and notice that it mentions:

given a query point q, if there exists a point within distance r from q, then it reports a point within distance cr from q. Here c is the approximation factor of the algorithm.

The point at distance 'r' and the point returned doesn't have to be the same.

Hope this helps.

TechQA.

example from LSHForest, results not convinced

There are 1 answers

Related Questions in NEAREST-NEIGHBOR

Related Questions in LOCALITY-SENSITIVE-HASH

Popular Questions

Popular Tags

Trending Questions