Approximated closest pair algorithm

846 views Asked by At

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).

Consider 4 points (A, B, C, D), and the following distances:

dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2

In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.

  1. Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?

  2. Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.

  3. Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.

  4. Imagine this problem for a set with 10⁸ points.

What kind of solution should I look for? Was this kind of problem solved before?

This is not a classroom problem or something related. I have been just thinking about this problem.

3

There are 3 answers

1
Ivan On BEST ANSWER

I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.

The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".

Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.

My implementation is here: https://github.com/jon1van/MTreeMapRepo

Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.

I suggest using my code (or the idea behind it) to implement a solution in which you:

  1. Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
  2. Return the closest pair of Keys when considering only the "winner" of each HashMap.

This style of solution would be a "divide and conquer" approach the returns an approximate solution.

You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.

Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate) Good Luck

2
RichardPlunkett On

If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E). Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.

Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.

0
Ante On

Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.

That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.