rancher rke up errors on etcd host health checks remote error: tls: bad certificate

4.3k views Asked by At

rke --debug up --config cluster.yml

fails with health checks on etcd hosts with error:

DEBU[0281] [etcd] failed to check health for etcd host [x.x.x.x]: failed to get /health for host [x.x.x.x]: Get "https://x.x.x.x:2379/health": remote error: tls: bad certificate

Checking etcd healthchecks

for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5"); do
  echo "Validating connection to ${endpoint}/health";
  curl -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/health";
done

Running on that master node
Validating connection to https://x.x.x.x:2379/health
{"health":"true"}
Validating connection to https://x.x.x.x:2379/health
{"health":"true"}
Validating connection to https://x.x.x.x:2379/health
{"health":"true"}
Validating connection to https://x.x.x.x:2379/health
{"health":"true"}
you can run it manually and see if it responds correctly
curl -w "\n" --cacert /etc/kubernetes/ssl/kube-ca.pem --cert /etc/kubernetes/ssl/kube-etcd-x-x-x-x.pem --key /etc/kubernetes/ssl/kube-etcd-x-x-x-x-key.pem https://x.x.x.x:2379/health

Checking my self signed certificates hashes

# md5sum /etc/kubernetes/ssl/kube-ca.pem
f5b358e771f8ae8495c703d09578eb3b  /etc/kubernetes/ssl/kube-ca.pem

# for key in $(cat /home/kube/cluster.rkestate | jq -r '.desiredState.certificatesBundle | keys[]'); do echo $(cat /home/kube/cluster.rkestate | jq -r --arg key $key '.desiredState.certificatesBundle[$key].certificatePEM' | sed '$ d' | md5sum) $key; done | grep kube-ca
f5b358e771f8ae8495c703d09578eb3b - kube-ca
versions on my master node
Debian GNU/Linux 10
rke version v1.3.1
docker version Version: 20.10.8
kubectl v1.21.5
v1.21.5-rancher1-1

I think my cluster.rkestate gone bad, are there any other locations where rke tool checks for certificates? Currently I cannot do anything with this production cluster, and want to avoid downtime. I experimented on testing cluster different scenarios, I could do as last resort to recreate the cluster from scratch, but maybe I can still fix it... rke remove && rke up

2

There are 2 answers

0
arnittocrab On

rke util get-state-file helped me to reconstruct bad cluster.rkestate file and I was able to successfully rke up and add new master node to fix whole situation.

0
Mostafa Ghadimi On

The problem can be solved by doing the following steps:

  1. Remove kube_config_cluster.yml file where you run rke up command. (Since some data are missing in your K8s nodes)

  2. Remove cluster.rkestate file.

  3. Re-run rke up command.