Cert-manager in Kuberbetes: Client.Timeout exceeded while awaiting headers

395 views Asked by At

Describing the issue:

Have a bare metal Kubernetes cluster with a single node, the deployments MetalLB and Nginx Ingress Controller for the routing traffic to the cluster are installed.

  • Kubernetes v1.28.3
  • MetalLB v0.13.12
  • Nginx Ingress Controller v1.8.0
  • Cert-manager v1.13.2

I created an Ingress resource to route the requests to the ArgoCD instance deployed in the cluster and installed cert-manager Helm chart for the TLS certificate management.

The ClusterIssuer looks to be installed correctly.

$ kubectl get clusterissuer -o wide

NAME                     READY   STATUS                                                 AGE
letsencrypt-production   True    The ACME account was registered with the ACME server   4d1h
letsencrypt-staging      True    The ACME account was registered with the ACME server   4d1h
selfsigned               True                                                           4d1h

This is the content of the Ingress resource declared for the ArgoCD deployment.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-production
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
  labels:
    app.kubernetes.io/component: server
    app.kubernetes.io/environment: develop
    app.kubernetes.io/instance: argocd
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: argocd-server
    app.kubernetes.io/part-of: argocd
    app.kubernetes.io/version: v2.8.6
    helm.sh/chart: argo-cd-5.50.1
  name: argocd-server
  namespace: develop
spec:
  ingressClassName: nginx
  rules:
  - host: argocd.mydomain.com
    http:
      paths:
      - backend:
          service:
            name: argocd-server
            port:
              number: 443
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - argocd.mydomain.com
    secretName: argocd-secret

But the ACME challenge can't be completed and the resources Challenge, Order and Certificate are stuck in Pending status.

Waiting for HTTP-01 challenge propagation: failed to perform self check GET request 'http://argocd.mydomain.com/.well-known/acme-challenge/Scy7Eh4E8LvN6yM1rT3y4qcCYKfEVZ6MHJdQNqKJN7M'.

Get "http://argocd.mydomain.com/.well-known/acme-challenge/Scy7Eh4E8LvN6yM1rT3y4qcCYKfEVZ6MHJdQNqKJN7M": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The fun stuff begin now, if I execute a curl command to that url in a machine outside the cluster, it works, I get the response. But if I execute from the node or from another pod inside the cluster, I get the time out.

Looks like is not a DNS issue in the network because executing the nslookup command from the node or a container inside the cluster works fine:

$ nslookup argocd.mydomain.com

Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
argocd.mydomain.com  canonical name = mydoamin.com.
Name:   mydomain.com
Address: xx.xx.xx.xx

Executing the curl request from the node, outside any container, using the IP of the node in the network, the response is OK.

curl http://192.168.1.1/.well-known/acme-challenge/Scy7Eh4E8LvN6yM1rT3y4qcCYKfEVZ6MHJdQNqKJN7M -H "Host: argocd.mydomain.com"

But replacing the local IP address by the FQDN, I get the timeout.

This is the specification of the Ingress Nginx controller service:

spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.96.108.180
  clusterIPs:
  - 10.96.108.180
  externalTrafficPolicy: Local
  healthCheckNodePort: 30708
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - appProtocol: http
    name: http
    nodePort: 31629
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    nodePort: 30634
    port: 443
    protocol: TCP
    targetPort: https

The cm-acme-http-solver is created in the same namespace of the ingress with log entries like:

I1113 18:50:06.845744       1 solver.go:87] 
cert-manager/acmesolver: got successful challenge request, writing key" host="mydomain.com" path="/.well-known/acme-challenge/1yJeWYLrQ_EP5MjWH_0Ztq8NodV81kaFkPHQ6Kz41CM" base_path="/.well-known/acme-challenge" token="1yJeWYLrQ_EP5MjWH_0Ztq8NodV81kaFkPHQ6Kz41CM

Probably I discarded any unexpected behavior with the CNI, replacing weave-net and using Calico now, I have the same behavior. Also doesn't look like a DNS problem because the DNS entry can be resolved inside or outside the cluster in the network. If execute the curl command with the Public IP fails:

curl -v -D- http://public.ip/.well-known/acme-challenge/1yJeWYLrQ_EP5MjWH_0Ztq8NodV81kaFkPHQ6Kz41CM

But replacing the public IP with the private IP of the LoadBalancer service in the network, works fine.

Can anyone guide or help me to fix this problem?

1

There are 1 answers

2
VonC On

The cm-acme-http-solver is created in the same namespace of the ingress

The fact that the cm-acme-http-solver pod is being created and is logging successful challenge requests is a good sign. It indicates that cert-manager is functioning and able to respond to ACME challenge requests. However, the issue seems to be in the routing of these challenge requests when they come from outside the cluster.

The flow of traffic from the internet to your Kubernetes cluster looks like this:

[ Internet ] --- [ MetalLB (LoadBalancer) ] --- [ Nginx Ingress Controller ]
                     |                              |
                     |                              `-- [ Ingress: ArgoCD ]
                     |                                 (Routes traffic to ArgoCD service)
                     |
                     `-- [ Public IP vs. Private IP Routing Issue ]
                           |
                           `-- [ cm-acme-http-solver Pod ]
                                (Handles ACME challenge requests)
                                |
                                `-- [ Logs: Successful Challenge Requests ]
                                     (But external requests timing out)

The MetalLB LoadBalancer is responsible for routing external traffic to the correct services inside the cluster. The Nginx Ingress Controller then routes this traffic to the appropriate Ingress resources, in this case, the ArgoCD service.

The cm-acme-http-solver pod, which handles the ACME challenge requests, is logging successful challenge requests, indicating it is functioning correctly within the cluster.
However, there is an issue with the routing of external requests to the cm-acme-http-solver, particularly when using the public IP.

The successful curl test with the private IP of the LoadBalancer service, but failure with the public IP, suggests a possible issue with how external traffic is being routed to the cluster. That could be a configuration issue with MetalLB or the way your network is handling traffic routing to the cluster.

Since MetalLB is used as a LoadBalancer, make sure it is correctly configured to handle both public and private IP addresses. MetalLB should correctly route incoming traffic on the public IP to the appropriate services within the cluster.

kubectl logs -n metallb-system -l component=speaker

Check if the Nginx Ingress Controller is properly configured to handle traffic coming from the LoadBalancer. That includes ensuring that the SSL passthrough is working as intended.

And double-check that there are any network restrictions (like firewall rules or network security group settings) that might be blocking the incoming traffic on the public IP.

You may need to trace the network packets to see where the routing fails. Tools like tcpdump or traceroute should help.

tcpdump -i <network-interface> 'port 80 and host <public-ip>'

Also, review the cert-manager challenges and logs in more detail to confirm that the challenges are indeed reaching the cluster and being responded to correctly.

kubectl describe challenge -n <namespace>

In the cluster I changed the CNI to Calico and the ingress to HA Proxy, and the behavior is the same, with the only difference that I can get the response from the ACME within a pod in the cluster using HTTPS, but over HTTP it still gives me timeout.

You would have to verify the SSL passthrough configuration. And I remembered there were some issues of compatibility with Weave and Calico here metallb.universe.tf/installation/network-addons.

With the shift to HA Proxy, make sure SSL passthrough is configured correctly. HA Proxy handles SSL passthrough differently than Nginx, and it is crucial to make sure the SSL traffic is correctly forwarded to the backend without termination at the Ingress level.
Examine the HA Proxy configuration to make sure it is correctly set up to handle both HTTP and HTTPS traffic. Also, check the logs for any errors or warnings.

kubectl logs -n <haproxy-namespace> -l app=<haproxy-label>
kubectl describe svc <haproxy-service-name> -n <haproxy-namespace>

Calico provides powerful network policy enforcement. Verify that there are no policies inadvertently blocking or misrouting HTTP traffic. Do check both ingress and egress policies.

kubectl get networkpolicies --all-namespaces