I am investigating the performance of out algorithm that runs on top of Hadoop 2.x. We would like to know how the calculation time breaks down in different pieces: - map time - reduce time - sort time - shuffle time
on the reduce side, there is a clear distinction in the counters: each of the components (reduce, shuffle, merge) has a separate counter. On the map side, there is also a sort taking place, but I cannot find the counters that are related to the sort time/amount. How can I find out the map side sort time?
Thanks.
You are talking about Map side sort/spill. You can look here for a good presentation on performance, at eash stage of mapreduce. Also in Hadoop Definitve guide, Chapter 6 - How Map reduce works, Shuffle and Sort, Map side, for more theory