I'm actually working on high dimensional data (~50.000-100.000 features) and nearest neighbors search must be performed on it. I know that KD-Trees has poor performance as dimensions grows, and also I've read that in general, all space-partitioning data structures tends to perform exhaustive search with high dimensional data.
Additionally, there are two important facts to be considered (ordered by relevance):
- Precision: The nearest neighbors must be found (not approximations).
- Speed: The search must be as fast as possible. (The time to create the data structure isn't really important).
So, I need some advice about:
- The data structure to perform k-NN.
- If it will be better to use an aNN (approximate nearest neighbor) approach, setting it as accurate as possible?.
No. Because of the curse of dimensionality, data structures that perform Nearest Neighbour search good in lower dimensions, fail to perform well in a high dimensional place. As a matter of fact, the query times becomes almost equally to the brute force one, thus it is worthless.
As a result, in a high dimensional space, one should go for Approximate Nearest Neighbour (ANN) search. To be honest, it's a must.
I would suggest LSH, or a number of RKD trees. In my answer here, I mention some good libraries that perform ANN in C++. However, notice that LSH solved the R-nearest neighbour problem, thus you specify the parameter R, which is actually the radius. Then, LSH will look for NN's inside that R from the query point, thus you can't really request k NN's.
On the other hand, RKD trees can do that and return you k NN's. I have a project, that builds a forest of RKD trees and performs ANN search in C++, but it targets only high dimensions. It can handle GIST datasets of 10^6 images in 960 dimensions in < 1 sec with about 90% outputs being true nearest neighbours. The name is kd-GeRaF. It will be updated in the next month with a distributed version, but it's already tested and ready to use. It also has a cute logo. :)
I also feel that you should read my answer, which says that the optimal data structure depends on the data.