I would like to have your opinion regarding Partitioner vs MultipleOutputs.
Suppose I have a file which contains keys as
0:aaa
1:bbb
0:ccc
0:ddd
...
1:zzz
I would like have 2 files: one file containing keys starting with 0:
and the other containing keys starting with 1:
. Which approach should I use:
1) Use a custom Partitioner which will parse the keys and returns 0 or 1 for getPartition().
2) Use MultipleOutputs.write in the reduce phase, by parsing the key and providing zero
or one
for the namedOutput
parameter of MultipleOutputs.write.
Which one is better? For me, 1) is better because reducers deal with a single file.
If your job is only to split the input files into 2 parts, then MultipleOutputs is a better bet as you can save on the shuffle / sort phase (by running a map only job).
Now if you have lots of input files and don't want 2x the number of output files as you have input, then using the partitioner based approach will allow you to consolidate the input files into 2 outputs (they won't be nicely named however, another benefit of MultipleOutputs, but you can easily fix this by using MultipleOutputs in your reducer and LaxyOutputFormat to ensure that the empty part-r files won't be written as output).
So to answer - it depends on how many input files you have, and how many output files you want.