As far as I understand, the reduce task has three phases.
Shuffle, Sort and actual reduce invocation.
So usually in hadoop job's output we see something like, map 0% reduce 0% map 20% reduce 0% . . . map 90% reduce 10% . . .
So I assume that the reduce tasks start before all the maps are finished and this behavior is controlled by the slow start configuration.
Now I don't yet understand when does the setup method of the reducer is actually called.
In my use case, I have some files to parse in the setup method. The file is about 60MB in size and is picked up from the distributed cache. While the file is being parsed, there is another set of data from configuration that can update the just parsed record. After parsing and possible updation, the file is stored in a HashMap for fast lookups. So I would like this method to be invoked as soon as possible, possibly while the mappers are still doing their thing.
Is it possible to do this? Or is that what already happens?
Thanks
Setup
is called right before it is able to read the first key/values pair from the stream.Which is effectively after all mappers ran and all the merging for a given reducer partition is finished.