Is kdtree used for speeding k-means clustering or not?

2.7k views Asked by At

I am doing a project by using k-means and my professor suggested kdtree. I found this implementation of kdtree in python (i know that there is also in scipy, but i couldn't find any sample implementation). My question is the same as the tittle, is it used kdtree for speeding up k-means, or i am wrong?

data = [(2,2),(1,0),(2,3),(10,5),(59,8),(4,2)]

tree = KDTree.construct_from_data(data)
nearest = tree.query(query_point=(5,4), t=3)
print nearest

output:

[(4, 2), (2, 3), (2, 2)]
2

There are 2 answers

1
alko On BEST ANSWER

As "Making k-means even faster", p 137, paper shows, kd-tree can be used for k means algorithm speed up for low-dimensional data, while straightforward Lloyd's algorithm is more efficient for higher dimension.

With high-dimensional data, indexing schemes such as k-d tree do not work well

See explanations in paper.

I suggest you to use one of established k-means implementation and worry about speed improvement only if you run into severe issues. For example, afaik, sklearn's KMeans is based on Lloyd's original algorithm.

6
Has QUIT--Anony-Mousse On

It can be used, but it's nontrivial. Most people only implement the straightforward un-accelerated solution.

The problem is that most kd-tree implementations only support nearest-neighbor queries.

This only pays off if you have a large number of clusters k, and build the index on the clusters.

For full kd-tree-k-means acceleration, you would need to implement a bipartite NN-join, where you would both have an index on the points and on the cluster centers. I'm not aware of any kd-tree implementation that would support this.