Describing the issue:
Have a bare metal Kubernetes cluster with a single node, the deployments MetalLB and Nginx Ingress Controller for the routing traffic to the cluster are installed.
- Kubernetes v1.28.3
- MetalLB v0.13.12
- Nginx Ingress Controller v1.8.0
- Cert-manager v1.13.2
I created an Ingress resource to route the requests to the ArgoCD instance deployed in the cluster and installed cert-manager Helm chart for the TLS certificate management.
The ClusterIssuer looks to be installed correctly.
$ kubectl get clusterissuer -o wide
NAME READY STATUS AGE
letsencrypt-production True The ACME account was registered with the ACME server 4d1h
letsencrypt-staging True The ACME account was registered with the ACME server 4d1h
selfsigned True 4d1h
This is the content of the Ingress resource declared for the ArgoCD deployment.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
cert-manager.io/cluster-issuer: letsencrypt-production
kubernetes.io/tls-acme: "true"
nginx.ingress.kubernetes.io/backend-protocol: HTTPS
nginx.ingress.kubernetes.io/ssl-passthrough: "true"
labels:
app.kubernetes.io/component: server
app.kubernetes.io/environment: develop
app.kubernetes.io/instance: argocd
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: argocd-server
app.kubernetes.io/part-of: argocd
app.kubernetes.io/version: v2.8.6
helm.sh/chart: argo-cd-5.50.1
name: argocd-server
namespace: develop
spec:
ingressClassName: nginx
rules:
- host: argocd.mydomain.com
http:
paths:
- backend:
service:
name: argocd-server
port:
number: 443
path: /
pathType: Prefix
tls:
- hosts:
- argocd.mydomain.com
secretName: argocd-secret
But the ACME challenge can't be completed and the resources Challenge, Order and Certificate are stuck in Pending status.
Waiting for HTTP-01 challenge propagation: failed to perform self check GET request 'http://argocd.mydomain.com/.well-known/acme-challenge/Scy7Eh4E8LvN6yM1rT3y4qcCYKfEVZ6MHJdQNqKJN7M'.
Get "http://argocd.mydomain.com/.well-known/acme-challenge/Scy7Eh4E8LvN6yM1rT3y4qcCYKfEVZ6MHJdQNqKJN7M": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The fun stuff begin now, if I execute a curl command to that url in a machine outside the cluster, it works, I get the response. But if I execute from the node or from another pod inside the cluster, I get the time out.
Looks like is not a DNS issue in the network because executing the nslookup command from the node or a container inside the cluster works fine:
$ nslookup argocd.mydomain.com
Server: 127.0.0.53
Address: 127.0.0.53#53
Non-authoritative answer:
argocd.mydomain.com canonical name = mydoamin.com.
Name: mydomain.com
Address: xx.xx.xx.xx
Executing the curl request from the node, outside any container, using the IP of the node in the network, the response is OK.
curl http://192.168.1.1/.well-known/acme-challenge/Scy7Eh4E8LvN6yM1rT3y4qcCYKfEVZ6MHJdQNqKJN7M -H "Host: argocd.mydomain.com"
But replacing the local IP address by the FQDN, I get the timeout.
This is the specification of the Ingress Nginx controller service:
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.96.108.180
clusterIPs:
- 10.96.108.180
externalTrafficPolicy: Local
healthCheckNodePort: 30708
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- appProtocol: http
name: http
nodePort: 31629
port: 80
protocol: TCP
targetPort: http
- appProtocol: https
name: https
nodePort: 30634
port: 443
protocol: TCP
targetPort: https
The cm-acme-http-solver is created in the same namespace of the ingress with log entries like:
I1113 18:50:06.845744 1 solver.go:87]
cert-manager/acmesolver: got successful challenge request, writing key" host="mydomain.com" path="/.well-known/acme-challenge/1yJeWYLrQ_EP5MjWH_0Ztq8NodV81kaFkPHQ6Kz41CM" base_path="/.well-known/acme-challenge" token="1yJeWYLrQ_EP5MjWH_0Ztq8NodV81kaFkPHQ6Kz41CM
Probably I discarded any unexpected behavior with the CNI, replacing weave-net and using Calico now, I have the same behavior. Also doesn't look like a DNS problem because the DNS entry can be resolved inside or outside the cluster in the network. If execute the curl command with the Public IP fails:
curl -v -D- http://public.ip/.well-known/acme-challenge/1yJeWYLrQ_EP5MjWH_0Ztq8NodV81kaFkPHQ6Kz41CM
But replacing the public IP with the private IP of the LoadBalancer service in the network, works fine.
Can anyone guide or help me to fix this problem?
The fact that the
cm-acme-http-solver
pod is being created and is logging successful challenge requests is a good sign. It indicates that cert-manager is functioning and able to respond to ACME challenge requests. However, the issue seems to be in the routing of these challenge requests when they come from outside the cluster.The flow of traffic from the internet to your Kubernetes cluster looks like this:
The MetalLB LoadBalancer is responsible for routing external traffic to the correct services inside the cluster. The Nginx Ingress Controller then routes this traffic to the appropriate Ingress resources, in this case, the ArgoCD service.
The
cm-acme-http-solver
pod, which handles the ACME challenge requests, is logging successful challenge requests, indicating it is functioning correctly within the cluster.However, there is an issue with the routing of external requests to the
cm-acme-http-solver
, particularly when using the public IP.The successful
curl
test with the private IP of the LoadBalancer service, but failure with the public IP, suggests a possible issue with how external traffic is being routed to the cluster. That could be a configuration issue with MetalLB or the way your network is handling traffic routing to the cluster.Since MetalLB is used as a LoadBalancer, make sure it is correctly configured to handle both public and private IP addresses. MetalLB should correctly route incoming traffic on the public IP to the appropriate services within the cluster.
Check if the Nginx Ingress Controller is properly configured to handle traffic coming from the LoadBalancer. That includes ensuring that the SSL passthrough is working as intended.
And double-check that there are any network restrictions (like firewall rules or network security group settings) that might be blocking the incoming traffic on the public IP.
You may need to trace the network packets to see where the routing fails. Tools like
tcpdump
ortraceroute
should help.Also, review the cert-manager challenges and logs in more detail to confirm that the challenges are indeed reaching the cluster and being responded to correctly.
With the shift to HA Proxy, make sure SSL passthrough is configured correctly. HA Proxy handles SSL passthrough differently than Nginx, and it is crucial to make sure the SSL traffic is correctly forwarded to the backend without termination at the Ingress level.
Examine the HA Proxy configuration to make sure it is correctly set up to handle both HTTP and HTTPS traffic. Also, check the logs for any errors or warnings.
Calico provides powerful network policy enforcement. Verify that there are no policies inadvertently blocking or misrouting HTTP traffic. Do check both ingress and egress policies.