In "Hadoop - Definitive Guide", it says -->
The client running the job calculates the splits for the job by calling getSplits(), then sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers.
public abstract class InputSplit {
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException,
}
We know that the getLocations() return a array of hostnames.
Question 1: How does the client knows which hostnames to return. Isn't it the job of the jobtracker?
Question 2: Can 2 different InputSplit objects return the same hostname? How are the hostnames decided. Who does that?
I feel the client contacts the namenode to get all the hostnames of a file (replicas included) , does some maths to arrive at the location set for each inputsplit. Is it true?
Q How does the client knows which hostnames to return. Isn't it the job of the jobtracker?
A. The input splits are created by the Input Format used in the job configuration. During the process of creating a logical set of splits, it reaches out to the Name Node asking for the location of the blocks that form the split. The responsibility of the job tracker is to ensure it tries to run the map task taking data locality into consideration based on the information in the InputSplit.
Question 2: Can 2 different InputSplit objects return the same hostname? How are the hostnames decided. Who does that?
A. Definitely. Each input splits has its own formula to calculate the splits . Bear in mind that a input split need not be of the same size of a block.
Hope this helps.