RedShift Node Failover

3.3k views Asked by At

I have a RedShift cluster of 4 nodes.

  1. When one of the nodes goes down, will the entire cluster become unavailable?
  2. If yes - for how long?
  3. When the cluster gets back - is it returned to exactly the same point it was before the failure, or the data may be rolled back a to S3 snapshot from a few hours ago?
  4. How can I simulate this situation to check this scenario by myself?

Thanks a lot!

2

There are 2 answers

1
diemacht On BEST ANSWER

If it's a single node failure - amazon will start a new node and stream data from other nodes (each block is written to two different nodes if any). In such case, we can expect:

  1. Downtime of the entire cluster till a new node starts up + filled with the DB information. Should be about 3-4 minutes.
  2. After these 3-4 minutes that cluster will return to exactly the same point it was before it went down. The cluster will be available to both reads and writes.
  3. Some slowdown will be experienced due to data redistribution in the cluster.

In case more than one nodes fails, redshift will restore itself from the latest S3 backup. S3 backups are done on the following occasions:

  1. If it's been 8 hours since the last backup
  2. If RedShift was filled with more then 5GB of data since the last backup
  3. Manually
  4. You have the option of a final snapshot when you chose to terminate your cluster
1
Tomasz Tybulewicz On

It just happened to my cluster - one of nodes failed. It took almost 20 minutes to get noticed in the dashboard (unhealthy was shown in 'Performance' tab, but healthy in 'Status' tab).

After 1h from initial failure, cluster changed its state to 'modifying' and after another 1h a new node was in place.

There is a message in 'Recent Events':

A node on Amazon Redshift cluster 'xxx' was automatically replaced at 2013-12-18 11:42 UTC. The cluster is now operating normally.

For the whole time cluster was unavailable - no queries were run, no imports were possible.

Data is exactly the same as in the moment of a failure.