Find the median of N^2 elements( large scale )

321 views Asked by At

The question is like this: Assume we have N machines, and each machine store and can manipulate its N elements, then, how can we find the median of all the N^2 elements in the lowest cost?

It really bothers me much, hope to get answer from you guys, thanks!

Sorry I just write it down too simple. The elements stored in each machine is random, and have no order. And the cost contains I/O cost, as well as communication between machines, RAM, time everything should be considered too. I just want to find the most efficient way to get the median.

These are some solutions I have come up with:

  1. use external sort like merge sort or something else, and find the median.
  2. use bucket sort, divide all the elements into X consecutive buckets according to its value, and so we can decide which bucket the median is in. Scan the bucket and we will get the median.
  3. I think the finding kth number in O(N) algorithm in "Introduction to Algorithms" should work here?

But still, all these solutions need an extra machine to do the job. I'm wondering whether there is a way that we can only use these N machines to get the median?

Thanks!

3

There are 3 answers

0
ElKamina On
Step 1: Sort the numbers at each machine individually
Step 2: Send the median at each machine to a central place
Step 3: Sort the medians and send it to each machine
Step 4: For each element in the sorted medians calculate the rank at machine level
Step 5: Calculate the rank of each element over all machines (just sum the rank)
Step 6: Find two elements in the sorted medians between which the global median exists
Step 7: For the next iteration consider only elements between those two medians 
        and repeat the whole thing again

In the worst case all the remaining elements in the second iteration will be on a single machine.

Complexity: Pretty sure it is O(nlogn) (i.e. including palatalization it can be O(n^2logn)

0
Lou Franco On

Can you estimate it rather than get it exactly?

If so, pick a constant K and fit a K-coefficient polynomial to the data on each machine, send the coefficients to a central machine that adds them and then finds the median by

  1. Integrating the curve over the range to find the area under the curve
  2. Doing a root-finding algorithm to find the point that splits the area in half.

The bigger K is, the less error there will be. The smaller K is, the more efficient it will be.

5
John Fisher On

You'll need to have a process that counts all the values (total across all the stores). Pick the middle index. Adjust the index to be an offset from the start of items on the appropriate machine. Ask that machine to sort the items and return the value for that index.