If the input file size is 200MB, there will be 4 blocks/ input splits, but each data node will have a mapper running on it. If all the 4 input splits are in the same data node, then only one map task will be executed?
or how does the number of map task depend on the input split?
Also will the Task Tracker run on all the data nodes and Job Tracker on one data node in the cluster?
Input Splits in Hadoop
139 views Asked by Harshi At
1
Number of maps entirely depends on no of splits, not on the location of the blocks/splits. So for your case it will be 4. As your are saying all in one node, you also have to consider that there will be replicas of those blocks in different nodes. Now there is concept of map-reduce processing, 'data locality' which hadoop will want to take advantage of. And another thing to consider here is avaiablity of resources. So for a block (a replica of all, commonly 3) to be executed hadoop will find a datanode in which the block is present and resource is available. So it may go up to a situation like you described, replicas of the 4 blocks are present in one of the nodes and it has resources that map-reduce will need. But map task will be 4, that is for sure.