NearestNeighbors tradeoff - run faster with less accurate results

340 views Asked by At

I'm working with a medium size dataset (shape=(14013L, 46L)). I want to smooth each sample with its knn. I'm training my model with:

NearestNeighbors(n_neighbors, algorithm='ball_tree',    
    metric=sklearn.metrics.pairwise.cosine_distances)

And the smooth is as follows:

def smooth(x,nbrs,data,alpha):
    """
    input:
        alpha: the smoothing factor
        nbrs: trained NearestNeighbors from sklearn
        data: the original data 
             (since NearestNeighbors returns only the index and not the samples)
        x:    what we want to smooth
    output:
        smoothed x with its nearest neighbours
    """
    distances, indices = nbrs.kneighbors(x)
    distances = map(lambda z:abs(-z+1),distances)[0]
    norm = sum(distances)
    if norm == 0:
        "No neighbours were found."
        return x
    distances = map(lambda z: (1-alpha)*z/norm ,distances)
    indices = map(lambda z: data[z],indices)[0]
    other = np.array([indices[i] * distances[i] for i in range(len(distances))])
    z = x * alpha
    z = z.reshape((1,z.shape[0]))
    smoothed = sum(np.concatenate((other,z),axis=0))
    return smoothed

My questions:

  1. How is it possible that no neighbors were found ?(I experienced it on my dataset hence the if condition)
  2. The fitting ("training") takes 18 seconds, but smoothing ~1000 samples takes more than 20 minutes. I'm willing to get less accurate results, if the smoothing process will be shorter. Is it possible? which parameters should I change in order to achieve it?
2

There are 2 answers

0
gsamaras On BEST ANSWER

First of all, why to use a Ball tree? Maybe your metric does imply this to you, but if that's not the case, you could use a kd-tree too.


I will approach your question from a theoretic point of view. The radius parameter is set 1.0 by default. This might be too small for your data set, since if I understand correctly, this will specify the radius from the query to look into. So, I would suggest to run your code with increasing this value, until you get some results. Then decrease it, until you get no results. Increase it a bit again and you have the optimal parameter for your data set.

An important parameter is leaf_size, which actually affects how many points you are going to check when a query arrives. A small value for this parameter might yield faster execution, with less precision.


You might also want to check this answer of mine, that explains the trade off between speed and accuracy, a trade off that is critical to understand when doing Nearest Neighbor search.

2
Andreus On

Without your dataset and complete code, it is difficult to say for certain. Here is what I think.

distances = map(lambda z:abs(-z+1),distances)[0]
norm = sum(distances)

Because you index the result of the map, you should only be getting the first neighbor. If x is actually one of the points you used to train with, then the first nearest neighbor will be...x. Because you are using the cosine distance, that distance is exactly: 1. abs(1-1) == 0. Before I suggest a correction, let's speak of performance.

With regard to performance: you are using the map function everywhere, which is a Python builtin. Scikit-learn is built on numpy, which is designed to do math much faster than native Python code. That is why training is so much faster than your code. You should be using numpy arithmetic instead of map. One example: this line

distances = map(lambda z: (1-alpha)*z/norm ,distances)

should be this

distances *= ((1-alpha)/norm)

If norm is an array, it should have the right dimensions for numpy broadcasting rules to kick in and "the right thing" be done, entirely in C.

Okay, so since I'm suggesting using numpy arrays (instead of map and Python lists), I believe the correct thing for the two lines before your if statement is to drop the indexing. (The rest of your code is probably broken by the indexing as well; after the second line of your function, distances is not an array or a list, but a scalar.)

distances = np.abs( distances-1 )
norm = np.sum(distances)

Additionally, you should not call your smooth() function lots of times, once for each sample. You should call it on an N_samples by N_dimensions==46 numpy array, and adjust your smooth() code appropriately. (NearestNeighbors, for example, will return a N_samples by N_neighbors array much more quickly than returning N_samples individual arrays of length N_samples.)