example from LSHForest, results not convinced

1.5k views Asked by At

The library and corresponding documentation is following -- yes i read everything and being able to "run" on my own codes.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html

But the results do not really make sense to me, so i went through the example (which is included at the previous webpage as well)

    >>> from sklearn.neighbors import LSHForest
    >>> X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
    >>> X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]
    >>> lshf = LSHForest()
    >>> lshf.fit(X_train)  
    LSHForest(min_hash_match=4, n_candidates=50, n_estimators=10,
              n_neighbors=5, radius=1.0, radius_cutoff_ratio=0.9,
              random_state=None)
    >>> distances, indices = lshf.kneighbors(X_test, n_neighbors=2)
    >>> distances                                        
        array([[ 0.069...,  0.149...],
               [ 0.229...,  0.481...],
               [ 0.004...,  0.014...]])
    >>> indices
        array([[1, 2],
               [2, 0],
               [4, 0]])

so i just try to verify the example by finding the nearest neighbors for the the three testing sets [9, 1, 6], [3, 1, 10], [7, 10, 3]

Say searching nearest neighbors for [9,1,6] (by using Euclidean distance), the closest training points are [5, 5, 2] and [6, 10, 2] (which i think the indices would [0.4]) -- that is significantly different to the results [1,2]

the distances also completely off the topic by the simple math calculation, my excel sheet is attached

thanks again for your time and help

1

There are 1 answers

5
Aramis7d On BEST ANSWER

It's not wrong, since LSHForest implements ANN (approximate near neighbor), and maybe that's the difference we need to take into consideration. The ANN-results are not the nearest neighbors, but an approximation of what the nearest neighbor should be.

For example, a 2-nearest neighbor result looks like:

from sklearn.neighbors import NearestNeighbors

X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]

nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X_train)
distances, indices = nbrs.kneighbors(X_test)

and returns

indices
Out[2]: 
array([[0, 2],
       [0, 2],
       [4, 3]], dtype=int64)

distances
Out[3]: 
array([[ 6.92820323,  9.43398113],
       [ 9.16515139,  9.21954446],
       [ 1.41421356,  2.44948974]])

If it helps, checkout this and notice that it mentions:

given a query point q, if there exists a point within distance r from q, then it reports a point within distance cr from q. Here c is the approximation factor of the algorithm.

The point at distance 'r' and the point returned doesn't have to be the same.

Hope this helps.