I am trying to configure a 50-node Hadoop 2.6.0 cluster for failure tolerance. Specifically, I'd like to be able to suddenly stop 5 servers and still have my job complete. So far, stopping even 1 server causes my job to fail with too many map failures error.
We host HDFS on the same cluster with replication factor = 2.
Can someone provide guidance on how to do this?
Having looked at similar posts, I am not not looking to have my job complete on subset of data.