I'm working with a medium size dataset (shape=(14013L, 46L)
).
I want to smooth each sample with its knn.
I'm training my model with:
NearestNeighbors(n_neighbors, algorithm='ball_tree',
metric=sklearn.metrics.pairwise.cosine_distances)
And the smooth is as follows:
def smooth(x,nbrs,data,alpha):
"""
input:
alpha: the smoothing factor
nbrs: trained NearestNeighbors from sklearn
data: the original data
(since NearestNeighbors returns only the index and not the samples)
x: what we want to smooth
output:
smoothed x with its nearest neighbours
"""
distances, indices = nbrs.kneighbors(x)
distances = map(lambda z:abs(-z+1),distances)[0]
norm = sum(distances)
if norm == 0:
"No neighbours were found."
return x
distances = map(lambda z: (1-alpha)*z/norm ,distances)
indices = map(lambda z: data[z],indices)[0]
other = np.array([indices[i] * distances[i] for i in range(len(distances))])
z = x * alpha
z = z.reshape((1,z.shape[0]))
smoothed = sum(np.concatenate((other,z),axis=0))
return smoothed
My questions:
- How is it possible that no neighbors were found ?(I experienced it on my dataset hence the
if
condition) - The fitting ("training") takes 18 seconds, but smoothing ~1000 samples takes more than 20 minutes. I'm willing to get less accurate results, if the smoothing process will be shorter. Is it possible? which parameters should I change in order to achieve it?
First of all, why to use a Ball tree? Maybe your metric does imply this to you, but if that's not the case, you could use a kd-tree too.
I will approach your question from a theoretic point of view. The
radius
parameter is set 1.0 by default. This might be too small for your data set, since if I understand correctly, this will specify the radius from the query to look into. So, I would suggest to run your code with increasing this value, until you get some results. Then decrease it, until you get no results. Increase it a bit again and you have the optimal parameter for your data set.An important parameter is
leaf_size
, which actually affects how many points you are going to check when a query arrives. A small value for this parameter might yield faster execution, with less precision.You might also want to check this answer of mine, that explains the trade off between speed and accuracy, a trade off that is critical to understand when doing Nearest Neighbor search.