I have an Elastic Beanstalk application that is intermittently not responding, and I'm unable to find out why. What Happens:
- The app will periodically respond with 200s to my health checks. And then, it will just stop. It will then come back on its own.
- Subsequent API calls 200, when the app is in a good mood. And then suddenly all calls fail (until they don't anymore).
- In the logs, I don't see any indications of crashing, but I'm new to this. I do see this peculiarity which shows up many times, and shows up corresponding to my api calls that I make to the app:
Mar 31 05:15:47 ip-172-31-28-174 systemd[1]: Starting [email protected] - Refresh policy routes for ens5...
Mar 31 05:15:47 ip-172-31-28-174 ec2net[2485]: Starting configuration for ens5
Mar 31 05:15:48 ip-172-31-28-174 systemd[1]: [email protected]: Deactivated successfully.
Mar 31 05:15:48 ip-172-31-28-174 systemd[1]: Finished [email protected] - Refresh policy routes for ens5.
Mar 31 05:15:48 ip-172-31-28-174 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=refresh-policy-routes@ens5 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Mar 31 05:15:48 ip-172-31-28-174 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=refresh-policy-routes@ens5 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Also, here's the setup:
- FastAPI python app, deployed originally through eb cli with classical load balancer. Load balancer was later migrated.
- 2 min instances, 4 max instances (all t3 micro)
- All instances are healthy.
- EB environment is healthy
- https:// listener is set up on EB configuration, using cert from AWS.
- CNAME configuration for the SSL on subdomain.
- Default VPC with two subnets in two separate zones.
- Subnets are mapped to route table that maps to IGW

- Procfile:
web: gunicorn main:app --workers=4 --worker-class=uvicorn.workers.UvicornWorker
What could it be? Is it a networking configuration issue? Load balancer? Or something with the application environment? I was also able to deploy my code to a single instance EBS application and had no issues with the downtime. I was not able to easily get https on that instance, so I can't identify if the issue was at the Load balancer level, or not.
I was able to figure out what was going on here.
Essentially traffic was being routed to a private subnet that was mapping traffic to an NAT gateway instead of an Internet gateway. Because there were two instances running, only sometimes would the requests be sent to an instance attached to the troubled subnet. To solve the problem, I updated the default subnet to point to an internet gateway on the Route table. (inbound traffic 0.0.0.0 -> IGN). I did this because I was not able to easily change how EBS picks the VPC and default subnets when launching from the command line.
There were a lot of things that led to this problem, which made it hard to troubleshoot. To be clear: