Kubernetes on Azure - liveness and readiness probes failing - Liveness probe failed with connect: connection refused

5.7k views Asked by At

I'm a noob with Azure deployment, kubernetes and HA implementation. When I implement health probes as part of my app deployment, the health probes fail and I end up with either 503 (internal server error) or 502 (bad gateway) error when I try accessing the app via the URL. When I remove the health probes, I can successfully access the app using its URL.

I use the following yaml deployment configuration when implementing the health probes, which is utilised by an Azure devops pipeline. The app takes under 5 mins to become available, so I set the initialDelaySeconds for the health probes to 300s.

apiVersion: apps/v1
kind: Deployment
metadata:
   name: myApp
spec:
   ... 
   template:
     metadata:
       labels:
         app: myApp
     spec:
        ...
        containers:
          - name: myApp
            ...
            ports:
              - containerPort: 5000          
            ...
            readinessProbe:
              tcpSocket:
                  port: 5000
              initialDelaySeconds: 300
              periodSeconds: 5
              successThreshold: 1
              failureThreshold: 3
            livenessProbe:
               tcpSocket:
                  port: 5000
               periodSeconds: 30 
               initialDelaySeconds: 300
               successThreshold: 1
               failureThreshold: 3

...

When I perform the deployment and describe the pod, I see the following listed under 'Events' at the bottom of the output:

  Type     Reason     Age                   From                             Message
  ----     ------     ----                  ----                             -------
  Warning  Unhealthy  2m1s (x288 over 86m)  kubelet, aks-vm-id-appears-here  Readiness probe failed: dial tcp 10.123.1.23:5000: connect: connection refused

(this is confusing as it states the age as 2m1s - but the initialDelaySeconds is greater than this - so I'm not sure why it reports this as the age)

The readiness probe subsequently fails with the same error. The IP number matches the IP of my pod and I see this under Containers in the pod description:

Containers:
....
Port:           5000/TCP

The failure of the liveness and readiness probes results in the pod being continually terminated and restarted.

The app has a default index.html page, so I believe the health probe should receive a 200 response if it's able to connect.

Because the health probe is failing, the pod IP doesn't get assigned to the endpoints object and therefore isn't assigned against the service.

If I comment out the readinessProbe and livenessProbe from the deployment, the app runs successfully when I use the URL via the browser, and the pod IP gets successfully assigned as an endpoint that the service can communicate with. The endpoint address is in the form 10.123.1.23:5000 - i.e. port 5000 seems to be the correct port for the pod.

I don't understand why the health probe would be failing to connect? It looks correct to me that it should be trying to connect on an IP that looks like 10.123.1.23:5000.

It's possible that the port is taking a long time than 300s to become open, but I don't know of a way I can check that. If I enter a bash session on the pod, watch isn't available (I read that watch ss -lnt can be used to examine port availability).

The following answer suggests increasing initialDelaySeconds but I already tried that - https://stackoverflow.com/a/51932875/1549918

I saw this question - but resource utilisation (e.g. CPU/RAM) is not the issue Liveness and readiness probe connection refused

UPDATE

If I curl from a replica of the pod to https://10.123.1.23:5000, I get a similar error (Failed to connect to ...the IP.. port 5000: Connection refused). Why could this be failing? I read something that suggests that attempting this connection from another pod may indicate reachability for the health probes also.

2

There are 2 answers

1
James On

If you are unsure if your application is starting correctly then replace it with a known good image. e.g. httpd

change the ports to 80, the image to httpd.

You might also want to increase the timeout for the health check as it defaults to 1 second to timeoutSeconds=5

in addition, if your image is a web application then it would be better to use a http probe

0
Bobby Donchev On

Your statement

The app has a default index.html page, so I believe the health probe should receive a 200 response if it's able to connect.

is incorrect.

You are doing a tcpSocket check. Try to switch to:

  livenessProbe:
    failureThreshold: 3
    httpGet:
      path: /
      port: 5000
      scheme: HTTP