On 1.17.2, each new node has InternalIP equal to ExternalIP

238 views Asked by At

starting this Monday, after an upgrade to 1.17.2 from within Rancher, each new node (DigitalOcean droplets, all with Ubuntu 18.04.3) gets its InternalIP incorrectly equal to the ExternalIP, the Public one! This is in turn my primary suspect culprit for the lack of intra-cluster DNS resolution we've been experiencing since Monday, as I've just found the unresponsive services were lying on the new node with InternalIP=ExternalIP.

kubectl describe node exacto-devel-mail-01
...
Addresses:
InternalIP:  37.139.20.177
Hostname:    exacto-devel-mail-01

An "old" node (pre-1.17.2 upgrade, so presumably we were running on 1.16.6):

kubectl describe node exacto-devel-06
...
Addresses:
InternalIP:  10.129.254.119
Hostname:    exacto-devel-06

I've tried to edit the node, assigning the correct InternalIP value, but nothing happened! It just continues to show the wrong address!

This issue of failure in resolving cluster-DNS names of containers running on those "bad" nodes with broken InternalIP's appeared on another cluster after upgrading it to v1.16.6. So I can say the issue affects Kubernetes 1.16.6 and 1.17.2, at least in Rancher-managed k8s clusters.

To further clarify the issue, here's the current list of nodes in my development environment cluster:

kubectl get nodes -o wide
NAME                  STATUS   ROLES               AGE    VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
exacto-devel-01       Ready    controlplane,etcd   11d    v1.17.2   37.139.0.68      <none>        RancherOS v1.5.5     4.14.138-rancher    docker://19.3.5
exacto-devel-02       Ready    controlplane,etcd   11d    v1.17.2   37.139.0.231     <none>        RancherOS v1.5.5     4.14.138-rancher    docker://19.3.5
exacto-devel-03       Ready    controlplane,etcd   11d    v1.17.2   37.139.4.139     <none>        RancherOS v1.5.5     4.14.138-rancher    docker://19.3.5
exacto-devel-04       Ready    worker              11d    v1.17.2   10.129.254.158   <none>        Ubuntu 18.04.3 LTS   4.15.0-74-generic   docker://19.3.5
exacto-devel-05       Ready    worker              11d    v1.17.2   10.129.254.200   <none>        Ubuntu 18.04.3 LTS   4.15.0-74-generic   docker://19.3.5
exacto-devel-06       Ready    worker              10d    v1.17.2   10.129.254.119   <none>        Ubuntu 18.04.3 LTS   4.15.0-74-generic   docker://19.3.2
exacto-devel-elk-01   Ready    worker              25h    v1.17.2   185.14.186.204   <none>        Ubuntu 18.04.4 LTS   4.15.0-76-generic   docker://19.3.5
exacto-devel-elk-02   Ready    worker              7h8m   v1.17.2   198.211.118.87   <none>        Ubuntu 18.04.4 LTS   4.15.0-76-generic   docker://19.3.5

As you can see, 01,02,03 and elk-01,elk-02 all are affected by this issue: Under the "InternalIP" column header there's a clearly identifiable Public IP! While it doesn't seem to matter on 01,02 and 03 nodes, since they've got etcd and controlplane role, it is actually blocking the expansion of the cluster to add new functionality (ELK in this example), since any workload deployed on those new nodes will face intra-cluster DNS resolution issues.

Please advise on what to do next! Thank you

1

There are 1 answers

7
Arghya Sadhu On BEST ANSWER

When you create custom cluster from rancher you can specify IP address to the rancher agent and agent will register the node with that IP.