I'm a noob with Azure deployment, kubernetes and HA implementation. When I implement health probes as part of my app deployment, the health probes fail and I end up with either 503 (internal server error) or 502 (bad gateway) error when I try accessing the app via the URL. When I remove the health probes, I can successfully access the app using its URL.
I use the following yaml deployment configuration when implementing the health probes, which is utilised by an Azure devops pipeline. The app takes under 5 mins to become available, so I set the initialDelaySeconds
for the health probes to 300s
.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myApp
spec:
...
template:
metadata:
labels:
app: myApp
spec:
...
containers:
- name: myApp
...
ports:
- containerPort: 5000
...
readinessProbe:
tcpSocket:
port: 5000
initialDelaySeconds: 300
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
livenessProbe:
tcpSocket:
port: 5000
periodSeconds: 30
initialDelaySeconds: 300
successThreshold: 1
failureThreshold: 3
...
When I perform the deployment and describe the pod, I see the following listed under 'Events' at the bottom of the output:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 2m1s (x288 over 86m) kubelet, aks-vm-id-appears-here Readiness probe failed: dial tcp 10.123.1.23:5000: connect: connection refused
(this is confusing as it states the age as 2m1s - but the initialDelaySeconds
is greater than this - so I'm not sure why it reports this as the age)
The readiness probe subsequently fails with the same error. The IP number matches the IP of my pod and I see this under Containers
in the pod description:
Containers:
....
Port: 5000/TCP
The failure of the liveness and readiness probes results in the pod being continually terminated and restarted.
The app has a default index.html
page, so I believe the health probe should receive a 200 response if it's able to connect.
Because the health probe is failing, the pod IP doesn't get assigned to the endpoints object and therefore isn't assigned against the service.
If I comment out the readinessProbe
and livenessProbe
from the deployment, the app runs successfully when I use the URL via the browser, and the pod IP gets successfully assigned as an endpoint that the service can communicate with. The endpoint address is in the form 10.123.1.23:5000 - i.e. port 5000 seems to be the correct port for the pod.
I don't understand why the health probe would be failing to connect? It looks correct to me that it should be trying to connect on an IP that looks like 10.123.1.23:5000.
It's possible that the port is taking a long time than 300s to become open, but I don't know of a way I can check that. If I enter a bash session on the pod, watch
isn't available (I read that watch ss -lnt
can be used to examine port availability).
The following answer suggests increasing initialDelaySeconds
but I already tried that - https://stackoverflow.com/a/51932875/1549918
I saw this question - but resource utilisation (e.g. CPU/RAM) is not the issue Liveness and readiness probe connection refused
UPDATE
If I curl from a replica of the pod to https://10.123.1.23:5000, I get a similar error (Failed to connect to ...the IP.. port 5000: Connection refused
). Why could this be failing? I read something that suggests that attempting this connection from another pod may indicate reachability for the health probes also.
If you are unsure if your application is starting correctly then replace it with a known good image. e.g. httpd
change the ports to 80, the image to httpd.
You might also want to increase the timeout for the health check as it defaults to 1 second to timeoutSeconds=5
in addition, if your image is a web application then it would be better to use a http probe