First, the background:
Yesterday our AWS-based business in US West 2, consisting of two auto-scale groups (and various other components like RDS further back) behind an ALB went offline for six hours. Service was only reinstated by building an entirely new ALB (migrating over the rules and target groups).
At 4:15am our local time (GMT+10) the ALB ceased to receive inbound traffic and would not respond to web traffic. We used it for port 80 and port 443 (with SSL cert) termination. At the same time, all target group instances were also marked as "Unhealthy" (although they most certainly were operable) and no traffic was forwarded on to them. DNS resolved correctly to the ALB. It simply stopped responding. Equivalent symptoms to a network router/switch being either switched off or firewalled out of existence.
Our other EC2 servers that were not behind the ALB continued to operate.
Initial thoughts were:
a) deliberate isolation by AWS? Bill not paid, some offence taken at an abuse report? Unlikely and AWS had not notified us of any transgression or reason to take action.
b) A mistake on our part in network configuration? No change had been made in days to NACL or security groups. Further we were sound asleep when it happened, nobody was fiddling with settings. When we built the replacement ALB we used the same NACL and security groups without problem.
c) Maintenance activity gone wrong? This seems most likely. But AWS appeared not to detect the failure. And we didn't pick it up because we considered a complete, inexplicable, and undetected failure of an ALB as "unlikely". We will need to put in place some external healthchecks of our own. We have some based upon Nagios so can enable alarming. But this doesn't help if an ALB is unstable - it is not practical to keep having to build a new one if this reoccurs.
The biggest concern is that this happened suddenly and unexpectedly and that AWS did not detect this. Normally we are never worried about AWS network infrastructure as "it just works". Until now. There's no user-serviceable options for an ALB (eg restart/refresh).
And now my actual question:
Has anyone else ever seen something like this? If so, what can be done to get service back faster or prevent it in the first place? If this happened to you what did you do?
I'm going to close this off.
It happened again the following Sunday, and again this evening. Exact same symptoms. Restoration was initially achieved by creating a new ALB and migrating rules and target groups over. Curiously, the previous ALB was observed to be operational again but when we tried to reinstate it then it failed again.
Creating new ELBs is no longer a workaround and we've switched to an AWS business support to get direct help from AWS.
Our best hypotheses is this: AWS have changed something in their maintenance process and the ALB (which is really just a collection of EC2 instances with some AWS "proprietary code") is failing but it's really just wild speculation.