I'm using the brute-force algorithm for KNN to find nearest neighbors in my web service. One downside of this approach is that I need enough memory on every machine to load the whole array for KNN. Now I'm thinking of splitting the array, do KNN separately on many machines, and then merge the results using merge sort. But that would be slow if the client need to create a lot of connections to query each part of the result.
I have read about algorithms like KDTree
and Balltree
in the documentation of sklearn
, I wonder if I can somehow separate the array so that I can quickly check which part of the array I need to look into to find nearest neighbors.