I am trying to understand the impact erasure coding could have on read performance of a file.
Before that a brief summary of Hadoop 3.0 erasure coding using the Reed-Solomon method. If a file split into k blocks are encoded into p parity blocks, then of k+p block atleast any k blocks must be available to recreate the file. In Hadoop 2.0 the default replication was 3, so a 10 blocks file needs 30 blocks of space on the cluster. Hadoop 3.0 states that it provides 50% space reduction, so the same 10 blocks when stored on 3.0 should need only 15 blocks i.e the additional 5 blocks can be used as parity blocks.
In Hadoop 3.0 - A file (file1) with 10 blocks will lead to 5 parity blocks (taking data improvement with EC in 3.0 to be 50%). Say the original 10 blocks of data is stored on nodes from n0 to n9 and the 5 parity blocks are stored on nodes n10 to n14. Now a read operation on this file should definitely fetch data from the first 10 nodes i.e n0 to n9 As fetching data from the nodes with parity blocks could require more time as it involves decoding(right??).
Next, the acceptable number of node failures for this file is 5.
If nodes n10 - n14 fail (these are the nodes with parity blocks). The performance of the read operation (due to nodes failure) will not be effected and the performance is same as in the scenario above.
But if nodes n5 to n9 fail, then I will guess that the read performance, in this case, will be lower than the performance in the above cases.
But in 2.0 you can expect same performance irrespective of which nodes have failed as long as the number of node failures is less than ReplicationFactor-1.
The question is: should it be expected to add the above factor (erasure coding) also to the set of factors that can affect read performance in 3.0
Did you have a look to these presentations ?
https://fr.slideshare.net/HadoopSummit/debunking-the-myths-of-hdfs-erasure-coding-performance
https://fr.slideshare.net/HadoopSummit/hdfs-erasure-coding-in-action
EC will be slower than Replication as soon as there are bad blocks. EC will put more pressure on CPU on the write path but less pressure on IO. What is less clear, is how EC affect read performance in real life where your Spark or Hadoop jobs do not span the whole clusters and suffers from not data locality. I would expect that Replication 3 would give more headroom to optimize data locality in non-ideal configuration compared to EC, but I cannot manage to gather feedbacks on this.