Docker Swarm failover behavior seems a bit underwhelming

619 views Asked by At

I am currently trying to use Docker Swarm to set up our application (consisting of both stateless and stateful services) in a highly available fashion on a three node cluster. With "highly available" I mean "can survice the failure of one out of the three nodes".

We have been doing such installations (using other means, not Docker, let alone Docker Swarm) for quite a while now with good success, including acceptable failover behavior, so our application itself (resp. the services that constitute it) has/have proven that in such a three node setup it/they can be made highly available.

With Swarm, I get the application up and running successfully (with all three nodes up) and have taken care that I have each service configured redundantly, i.e., more than one instance exists for each of them, they are properly configured for HA, and not all instances of a service are located on the same Swarm node. Of course, I also took care that all my Swarm nodes joined the Swarm as manager nodes, so that anyone of them can be leader of the swarm if the original leader node fails.

In this "good" state, I can reach the services on their exposed ports on any of the nodes, thanks to Swarm's Ingress networking. Very cool. In a production environment, we could now put a highly-available loadbalancer in front of our swarm worker nodes, so that clients have a single IP address to connect to and would not even notice if one of the nodes goes down.

So now it is time to test failover behavior... I would expect that killing one Swarm node (i.e., hard shutdown of the VM) would leave my application running, albeit in "degraded" mode, of course. Alas, after doing the shutdown, I cannot reach ANY of my services via their exposed (via Ingress) ports anymore for a considerable time. Some do become reachable again and indeed have recovered successfully (e.g., a three node Elasticsearch cluster can be accessed again, of course now lacking one node, but back in "green" state). But others (alas, this includes our internal LB...) remain unreachable via their published ports.

"docker node ls" shows one node as unreachable

$ docker node ls
ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER 
STATUS
kma44tewzpya80a58boxn9k4s *  manager1  Ready   Active        Reachable
uhz0y2xkd7fkztfuofq3uqufp    manager2  Ready   Active        Leader
x4bggf8cu371qhi0fva5ucpxo    manager3  Down    Active        Unreachable

as expected.

What could I be doing wrong regarding my Swarm setup that causes these effects? Am I just expecting too much here?

0

There are 0 answers