Why does Ingress fail when LoadBalancer works on GKE?

2k views Asked by At

I can't get Ingress to work on GKE, owing to health check failures. I've tried all of the debugging steps I can think of, including:

  • Verified I'm not running low on any quotas
  • Verified that my service is accessible from within the cluster
  • Verified that my service works behind a k8s/GKE Load Balancer.
  • Verified that healthz checks are passing in Stackdriver logs

... I'd love any advice about how to debug or fix. Details below!


I have set up a service with type LoadBalancer on GKE. Works great via external IP:

apiVersion: v1
kind: Service
metadata:
  name: echoserver
  namespace: es
spec:
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
  type: LoadBalancer
  selector:
    app: echoserver

Then I try setting up an Ingress on top of this same service:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: echoserver-ingress
  namespace: es
  annotations:
    kubernetes.io/ingress.class: "gce"
    kubernetes.io/ingress.global-static-ip-name: "echoserver-global-ip"
spec:
  backend:
    serviceName: echoserver
    servicePort: 80

The Ingress gets created, but it thinks the backend nodes are unhealthy:

$ kubectl --namespace es describe ingress echoserver-ingress | grep backends
  backends:     {"k8s-be-31102--<snipped>":"UNHEALTHY"}

Inspecting the state of the Ingress backend in the GKE web console, I see the same thing:

0 of 3 healthy

The health check details appear as expected:

health check details

... and from within a pod in my cluster I can call the service successfully:

# curl  -vvv echoserver  2>&1 | grep "< HTTP"
< HTTP/1.0 200 OK

# curl  -vvv echoserver/healthz  2>&1 | grep "< HTTP"
< HTTP/1.0 200 OK

And I can address the service by NodePort:

# curl  -vvv 10.0.1.1:31102  2>&1 | grep "< HTTP" 
< HTTP/1.0 200 OK

(Which goes without saying, because the Load Balancer service I set up in step 1 resulted in a web site that's working just fine.)

I also see healthz checks passing in Stackdriver logs:

enter image description here

Regarding quotas, I check and see I'm only using 3 of 30 backend services:

$ gcloud compute project-info describe | grep -A 1 -B 1  BACKEND_SERVICES
- limit: 30.0
  metric: BACKEND_SERVICES
  usage: 3.0
3

There are 3 answers

1
Jayson Chacko On

You have configured the timeout value to be 1 second. Perhaps increasing it to 5 seconds will solve the issue.

4
Vincent On

Had a similar issue a few weeks ago. What fixed it for me was to add a NodePort in the service description so that the Google Cloud Loadbalancer can probe this NodePort. The config that worked for me:

apiVersion: v1
kind: Service
metadata: 
  name: some-service
spec: 
  selector: 
    name: some-app
  type: NodePort
  ports: 
    - port: 80
      targetPort: 8080
      nodePort: 32000
      protocol: TCP

It might take some time for the ingress to pick this up. You ca re-create the ingress to speed things up.

0
Chris On

I had this issue, and eventually came across https://stackoverflow.com/a/50645953/9276, which made me look at my firewall settings. Sure enough, the last several NodePort services I had added were not enabled in the firewall rule, and therefore the health checks from the ingress pointing at them all failed. Manually adding the new host ports to the firewall rule fixed this issue for me.

However, unlike the linked answer, I do not have an invalid certificate in use. I'm guessing there are other errors or weird states that can cause this behavior, but I have not found the reason the rules stopped being automatically managed yet.

Possibly unrelated, I did not have this problem in our qa environment, just production, so there may be GCP project level settings at play.