I am currently trying to use Docker Swarm to set up our application (consisting of both stateless and stateful services) in a highly available fashion on a three node cluster. With "highly available" I mean "can survice the failure of one out of the three nodes".
We have been doing such installations (using other means, not Docker, let alone Docker Swarm) for quite a while now with good success, including acceptable failover behavior, so our application itself (resp. the services that constitute it) has/have proven that in such a three node setup it/they can be made highly available.
With Swarm, I get the application up and running successfully (with all three nodes up) and have taken care that I have each service configured redundantly, i.e., more than one instance exists for each of them, they are properly configured for HA, and not all instances of a service are located on the same Swarm node. Of course, I also took care that all my Swarm nodes joined the Swarm as manager nodes, so that anyone of them can be leader of the swarm if the original leader node fails.
In this "good" state, I can reach the services on their exposed ports on any of the nodes, thanks to Swarm's Ingress networking. Very cool. In a production environment, we could now put a highly-available loadbalancer in front of our swarm worker nodes, so that clients have a single IP address to connect to and would not even notice if one of the nodes goes down.
So now it is time to test failover behavior... I would expect that killing one Swarm node (i.e., hard shutdown of the VM) would leave my application running, albeit in "degraded" mode, of course. Alas, after doing the shutdown, I cannot reach ANY of my services via their exposed (via Ingress) ports anymore for a considerable time. Some do become reachable again and indeed have recovered successfully (e.g., a three node Elasticsearch cluster can be accessed again, of course now lacking one node, but back in "green" state). But others (alas, this includes our internal LB...) remain unreachable via their published ports.
"docker node ls" shows one node as unreachable
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER
STATUS
kma44tewzpya80a58boxn9k4s * manager1 Ready Active Reachable
uhz0y2xkd7fkztfuofq3uqufp manager2 Ready Active Leader
x4bggf8cu371qhi0fva5ucpxo manager3 Down Active Unreachable
as expected.
What could I be doing wrong regarding my Swarm setup that causes these effects? Am I just expecting too much here?