Connection to AWS MemoryDB cluster sometimes fails

1.5k views Asked by At

We have an application that is using AWS MemoryDB for Redis. We have setup a cluster with one shard and two nodes. One of the nodes (named 0001-001) is a primary read/write while the other one is a read replica (named 0001-002).

After deploying the application, connecting to MemoryDB sometimes fails when we use the cluster endpoint connection string to connect. If we restart the application a few times it suddenly starts working. It seems to be random when it succeeds or not. The error we get is the following:

Endpoint Unspecified/ourapp-memorydb-cluster-0001-001.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com:6379 serving hashslot 6024 is not reachable at this point of time. Please check connectTimeout value. If it is low, try increasing it to give the ConnectionMultiplexer a chance to recover from the network disconnect. IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=0,Free=32767,Min=2,Max=32767), Local-CPU: n/a

  • If we connect directly to the primary read/write node we get no such errors.
  • If we connect directly to the read replica it always fails. It even gets the error above, compaining about the "0001-001" node.
  • We use .NET Core 6
  • We use Microsoft.Extensions.Caching.StackExchangeRedis 6.0.4 which depends on StackExchange.Redis 2.2.4
  • The application is hosted in AWS ECS

StackExchangeRedisCache is added to the service collection in a startup file :

services.AddStackExchangeRedisCache(o =>
{
   o.InstanceName = redisConfiguration.Instance;
   o.ConfigurationOptions = ToRedisConfigurationOptions(redisConfiguration);
});

...where ToRedisConfiguration returns a basic ConfigurationOptions object :

new ConfigurationOptions()
{
    EndPoints =
    {
        { "clustercfg.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com", 6379 } // Cluster endpoint
    },
    User = "username",
    Password = "password",
    Ssl = true,
    AbortOnConnectFail = false,
    ConnectTimeout = 60000
};

We tried multiple shards with multiple nodes and it also sometimes fail to connect to the cluster. We even tried to update the dependency StackExchange.Redis to 2.5.43 but no luck.

We could "solve" it by directly connecting to the primary node, but if a failover occurs and 0001-002 becomes the primary node we would have to manually change our connection string, which is not acceptable in a production environment.

Any help or advice is appreciated, thanks!

0

There are 0 answers