The library and corresponding documentation is following -- yes i read everything and being able to "run" on my own codes.
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html
But the results do not really make sense to me, so i went through the example (which is included at the previous webpage as well)
>>> from sklearn.neighbors import LSHForest
>>> X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
>>> X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]
>>> lshf = LSHForest()
>>> lshf.fit(X_train)
LSHForest(min_hash_match=4, n_candidates=50, n_estimators=10,
n_neighbors=5, radius=1.0, radius_cutoff_ratio=0.9,
random_state=None)
>>> distances, indices = lshf.kneighbors(X_test, n_neighbors=2)
>>> distances
array([[ 0.069..., 0.149...],
[ 0.229..., 0.481...],
[ 0.004..., 0.014...]])
>>> indices
array([[1, 2],
[2, 0],
[4, 0]])
so i just try to verify the example by finding the nearest neighbors for the the three testing sets [9, 1, 6], [3, 1, 10], [7, 10, 3]
Say searching nearest neighbors for [9,1,6] (by using Euclidean distance), the closest training points are [5, 5, 2] and [6, 10, 2] (which i think the indices would [0.4]) -- that is significantly different to the results [1,2]
the distances also completely off the topic by the simple math calculation, my excel sheet is attached
thanks again for your time and help
It's not wrong, since LSHForest implements ANN (approximate near neighbor), and maybe that's the difference we need to take into consideration. The ANN-results are not the nearest neighbors, but an approximation of what the nearest neighbor should be.
For example, a 2-nearest neighbor result looks like:
and returns
If it helps, checkout this and notice that it mentions:
The point at distance 'r' and the point returned doesn't have to be the same.
Hope this helps.