starting this Monday, after an upgrade to 1.17.2 from within Rancher, each new node (DigitalOcean droplets, all with Ubuntu 18.04.3) gets its InternalIP incorrectly equal to the ExternalIP, the Public one! This is in turn my primary suspect culprit for the lack of intra-cluster DNS resolution we've been experiencing since Monday, as I've just found the unresponsive services were lying on the new node with InternalIP=ExternalIP.
kubectl describe node exacto-devel-mail-01
...
Addresses:
InternalIP: 37.139.20.177
Hostname: exacto-devel-mail-01
An "old" node (pre-1.17.2 upgrade, so presumably we were running on 1.16.6):
kubectl describe node exacto-devel-06
...
Addresses:
InternalIP: 10.129.254.119
Hostname: exacto-devel-06
I've tried to edit the node, assigning the correct InternalIP value, but nothing happened! It just continues to show the wrong address!
This issue of failure in resolving cluster-DNS names of containers running on those "bad" nodes with broken InternalIP's appeared on another cluster after upgrading it to v1.16.6. So I can say the issue affects Kubernetes 1.16.6 and 1.17.2, at least in Rancher-managed k8s clusters.
To further clarify the issue, here's the current list of nodes in my development environment cluster:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
exacto-devel-01 Ready controlplane,etcd 11d v1.17.2 37.139.0.68 <none> RancherOS v1.5.5 4.14.138-rancher docker://19.3.5
exacto-devel-02 Ready controlplane,etcd 11d v1.17.2 37.139.0.231 <none> RancherOS v1.5.5 4.14.138-rancher docker://19.3.5
exacto-devel-03 Ready controlplane,etcd 11d v1.17.2 37.139.4.139 <none> RancherOS v1.5.5 4.14.138-rancher docker://19.3.5
exacto-devel-04 Ready worker 11d v1.17.2 10.129.254.158 <none> Ubuntu 18.04.3 LTS 4.15.0-74-generic docker://19.3.5
exacto-devel-05 Ready worker 11d v1.17.2 10.129.254.200 <none> Ubuntu 18.04.3 LTS 4.15.0-74-generic docker://19.3.5
exacto-devel-06 Ready worker 10d v1.17.2 10.129.254.119 <none> Ubuntu 18.04.3 LTS 4.15.0-74-generic docker://19.3.2
exacto-devel-elk-01 Ready worker 25h v1.17.2 185.14.186.204 <none> Ubuntu 18.04.4 LTS 4.15.0-76-generic docker://19.3.5
exacto-devel-elk-02 Ready worker 7h8m v1.17.2 198.211.118.87 <none> Ubuntu 18.04.4 LTS 4.15.0-76-generic docker://19.3.5
As you can see, 01,02,03 and elk-01,elk-02 all are affected by this issue: Under the "InternalIP" column header there's a clearly identifiable Public IP! While it doesn't seem to matter on 01,02 and 03 nodes, since they've got etcd and controlplane role, it is actually blocking the expansion of the cluster to add new functionality (ELK in this example), since any workload deployed on those new nodes will face intra-cluster DNS resolution issues.
Please advise on what to do next! Thank you
When you create custom cluster from rancher you can specify IP address to the rancher agent and agent will register the node with that IP.