Hadoop 3.0 erasure coding - determining the number of acceptable node failures?

363 views Asked by At

In hadoop 2.0 the default replication factor is 3. And the number of node failures acceptable was 3-1=2.
So on a 100 node cluster if a file was divided in to say 10 parts (blocks), with replication factor of 3 the total storage blocks required are 30. And if any 3 nodes containing a block X and it's replicas failed then the file is not recoverable. Even if the cluster had 1000 nodes or the file was split in to 20 parts, failure of 3 nodes on the cluster can still be disastrous for the file.

Now stepping into hadoop 3.0.
With erasure coding, as Hadoop says it provides the same durability with 50% efficient storage. And based on how Reed-Solomon method works (that is for k data blocks and n parity blocks, at least k of the (k+n) blocks should be accessible for the file to be recoverable/readable)
So for the same file above - there are 10 data blocks and to keep the data efficiency to 50%, 5 parity blocks can be added. So from the 10+5 blocks, at least any 10 blocks should be available for the file to be accessible. And on the 100 node cluster if each of the 15 blocks are stored on a separate node, then as you can see, a total of 5 node failures is acceptable. Now storing the same file (ie 15 blocks) on a 1000 node cluster would not make any difference w.r.t the number of acceptable node failures - it's still 5.
But the interesting part here is - if the same file (or another file) was divided into 20 blocks and then 10 parity block were added, then for the total of 30 blocks to be saved on the 100 node cluster, the acceptable number of node failures is 10.

The point I want to make here is -
in hadoop 2 the number of acceptable node failures is ReplicationFactor-1 and is clearly based on the replication factor. And this is a cluster wide property.

but in hadoop 3, say if the storage efficiency was fixed to 50%, then the number of acceptable node failures seems to be different for different files based on the number of blocks it is divided in to.

So can anyone comment if the above inference is correct? And how any clusters acceptable node failures is determined?

(And I did not want to make it complex above, so did not discuss the edge case of a file with one block only. But the algorithm, I guess, will be smart enough to replicate it as is or with parity data so that the data durability settings are guaranteed.)

Edit: This question is part of a series of questions I have on EC - Others as below -
Hadoop 3.0 erasure coding: impact on MR jobs performance?

1

There are 1 answers

6
rcgldr On

Using your numbers for Hadoop 2.0, each block of data is stored on 3 different nodes. As long as any one of the 3 nodes has not failed to read a specific block, that block of data is recoverable.

Again using your numbers, for Hadoop 3.0, every set of 10 blocks of data and 5 blocks of parities are stored on 15 different nodes. So the data space requirement is reduced to 50% overhead, but the number of nodes the data and parities are written to have increased by a factor of 5, from 3 nodes for Hadoop 2.0 to 15 nodes for Hadoop 3.0. Since the redundancy is based on Reed Solomon erasure correction, then as long as any 10 of the 15 nodes have not failed to read a specific set of blocks, that set of blocks is recoverable (maximum allowed failure for a set of blocks is 5 nodes). If it's 20 blocks of data and 10 blocks of parities, then the data and parity blocks are distributed on 30 different nodes (maximum allowed failure for a set of blocks is 10 nodes).

For a cluster-wide view, failure can occur if more than n-k nodes fail, regardless of the number of nodes, since there's some chance that a set of data and parity blocks will happen to include all of the failing nodes. To avoid this, n should be increased along with the number of nodes in a cluster. For 100 nodes, each set could be 80 blocks of data, 20 blocks of parity (25% redundancy). Note 100 nodes would be unusually large. The example from this web page is 14 nodes RS(14,10) (for each set: 10 blocks of data, 4 blocks of parity).

https://hadoop.apache.org/docs/r3.0.0

with your numbers, the cluster size would be 15 (10+5) or 30 (20+10) nodes.

For a file with 1 block or less than k blocks, n-k parity blocks would still be needed to ensure that it takes more than n-k nodes to fail before a failure occurs. For Reed Solomon encoding, this could be done by emulating leading blocks of zeroes for the "missing" blocks.


I thought I'd add some probability versus number of nodes in a cluster.

Assume node failure rate is 1%.

15 nodes, 10 for data, 5 for parities, using comb(a,b) for combinations of a things b at a time:

Probability of exactly x node failures is:

6 => ((.01)^6) ((.99)^9) (comb(15,6)) ~= 4.572 × 10^-9
7 => ((.01)^7) ((.99)^8) (comb(15,7)) ~= 5.938 × 10^-11
8 => ((.01)^8) ((.99)^7) (comb(15,8)) ~= 5.998 × 10^-13
...

Probability of 6 or more failures ~= 4.632 × 10^-9

30 nodes, 20 for data, 10 for parities

Probability of exactly x node failures is:

11 => ((.01)^11) ((.99)^19) (comb(30,11)) ~= 4.513 × 10^-15
12 => ((.01)^12) ((.99)^18) (comb(30,12)) ~= 7.218 × 10^-17
13 => ((.01)^13) ((.99)^17) (comb(30,13)) ~= 1.010 × 10^-18
14 => ((.01)^14) ((.99)^16) (comb(30,14)) ~= 1.238 × 10^-20

Probability of 11 or more failures ~= 4.586 × 10^-15

To show that the need for parity overhead decreases with number of nodes, consider the extreme case of 100 nodes, 80 for data, 20 for parties (25% redundancy):

Probability of exactly x node failures is:

21 => ((.01)^21) ((.99)^79) (comb(100,21)) ~= 9.230 × 10^-22
22 => ((.01)^22) ((.99)^78) (comb(100,22)) ~= 3.348 × 10^-23
23 => ((.01)^23) ((.99)^77) (comb(100,23)) ~= 1.147 × 10^-24

Probability of 21 or more failures ~= 9.577 × 10^-22