Processing big data on distributed system

59 views Asked by At

I was asked to solve this problem in a interview:
Suppose there are 4 million comments each with its own id and timestamp. Design an efficient algorithm that finds the most recent 1000 comments. You have 40 servers, and each server can handle 10 thousand comments at a time.
I was thinking about using MapReduce. How do i implement map and reduce function in order to to solve this problem?

1

There are 1 answers

0
Detritus On

As the question specifically asks about efficient algorithms, I suspect the interviewer cares less about techniques like MapReduce but more the underlying algorithm you will use. This seems like an application of Merge Sort. In this case you would divide the work load into 10K chunks, assign each chunk for sorting on the nodes and merge. Once complete you should have all 4 million entries sorted by date, you can than take the most recent 1000 entries. This algorithm would run in O(n log n)