Redis fail over when primary is up but slowed down by network

91 views Asked by At

I am not a Redis guru.

We had a recent case in production where AWS traffic caused a huge network spike for a about 15min. Once the network issue went away the cluster returned to normal in about 30 seconds. If typical response time is 1ms then during the network issue rose to 300-600ms+. Our Redis primary was in this AZ and was replying to queries but they were slow. Sentinel never marked it down as it could still contact it.

I would like to know if it would be possible to detect network latency or slow response time in general and force a election.

I am aware of this:

# sentinel down-after-milliseconds <master-name> <milliseconds>
#
# Number of milliseconds the master (or any attached slave or sentinel) should
# be unreachable (as in, not acceptable reply to PING, continuously, for the
# specified period) in order to consider it in S_DOWN state (Subjectively
# Down).
#
# Default is 30 seconds.
sentinel down-after-milliseconds mymaster 30000

My interpretation is that the above will mark DOWN a server that it has failed to contact at all during this timeout window and not one that is just no longer time responsive.

EXAMPLE:

If normal response is 1ms and suddenly delay went to 300ms and stayed there for 15min I would like Sentinel to collect response times between all Redis servers and elect one which has up to date information with master and the least response time between them.

Is this possible?

0

There are 0 answers