Suppose I am having a distributed networks of computer in which i have say 1000 storage nodes. Now if a new node is added, what should be done? Meaning the data now should get equally divided into 1001 nodes ?
Also will the answer change if nodes range is 10 instead of 1000.
The client machine first splits the file into block Say block A, Block B then client machine interact with NameNode to asks the location to place these blocks (Block A Block B).NameNode gives a list of datanodes to the clinet to write the data. NameNode generally choose nearest datanode from network for this.
Then client choose first datanode from those list and write the first block to the datanode and datanode replicates the block to another datanodes. NameNode keeps the information about files and their associated blocks.
HDFS will not move blocks from old datanodes to new datanodes to balance the cluster if a datanode added in hadoop cluster.To do this, you need to run the balancer.
The balancer program is a Hadoop daemon that redistributes blocks by moving them from over utilized datanodes to underutilized datanodes, while adhering to the block replica placement policy that makes data loss unlikely by placing block replicas on different racks. It moves blocks until the cluster is deemed to be balanced, which means that the utilization of every datanode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage.
Reference: Hadoop Definitive Guide 3rd edition Page No 350
As a hadoop admin you should schedule balance job at once in a day to balance blocks on hadoop cluster.
Useful link related to balancer:
http://www.swiss-scalability.com/2013/08/hadoop-hdfs-balancer-explained.html
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Installation-Guide/cdh4ig_balancer.html