In Wordcount, it appears that you can get More than 1 map task per block, with speculative execution off.
Does the jobtracker do some magic under the hood to distribute tasks more than provided by the InputSplits?
In Wordcount, it appears that you can get More than 1 map task per block, with speculative execution off.
Does the jobtracker do some magic under the hood to distribute tasks more than provided by the InputSplits?
The answer to this lies in the way that Hadoop InputFormats work:
IN HDFS :
Lets take an example where the blocks are of size 1MB, an input file to HDFS is of size 10MB, and the minimum split size is > 1MB
1) First, a file is added to HDFS.
2) The file is split in to 10 blocks, each of size 1MB.
3) Then, each 1MB block is read by input splitter.
4) Since the 1MB block is SMALLER then the MIN_SPLIT_SIZE, HDFS processes 1MB at a time, with no extra splitting.
The moral of the story: FileInputFormat will actually do extra splitting for you if your splits are below the minimum split size.
I guess I totally forgot about this, but looking back, this has been a feature in hadoop since the beginning. The ability of an input format to arbitrarily split blocks up at runtime is used by many ecosystem tools to distribute loads in an applicaiton specific way.
The part that is tricky here is the fact that in toy mapreduce jobs, would expect one block per split in all cases, and then in real clusters, we overlook the split default size parameters, which dont come into play unless you are using large files.
Blocks and Splits are 2 different things. You might get more than one mappers for one Block if that Block has more than one Splits.