errors in kubernetes cluster setup (scheduler and controller in continuous restarts every 10 mins)

1.4k views Asked by At

I installed kubernetes (v26) cluster on ubuntu 22.04.

  • I am able to launch nginx on master node and run curl against it, but hello-world-deployment doesn't work. It is pending forever
  • it appears the apiserver, controller-manager and scheduler have issues connecting to each other. Every 10 minutes or so kube-scheduler and kube-controller-manager restarts
  • unable to query etcd using etcdctl

Appreciate if anyone can provide possible fixes for these problems. Let me know if any information is needed from the cluster.

Thanks!

$ sudo etcdctl --endpoints=https://127.0.0.1:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
cluster may be unhealthy: failed to list members
Error:  unexpected status code 404

tcp        0      0 127.0.0.1:2379          0.0.0.0:*               LISTEN      -
tcp        0      0 192.168.1.108:2379      0.0.0.0:*               LISTEN      -
tcp        0      0 192.168.1.108:2380      0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:54950         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:2379          127.0.0.1:55226         ESTABLISHED -
tcp        0      0 127.0.0.1:2379          127.0.0.1:54952         ESTABLISHED -
tcp        0      0 127.0.0.1:55272         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:55068         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:2379          127.0.0.1:54996         ESTABLISHED -
tcp        0      0 127.0.0.1:2379          127.0.0.1:54872         ESTABLISHED -
tcp        0      0 127.0.0.1:2379          127.0.0.1:54898         ESTABLISHED -
tcp        0      0 127.0.0.1:2379          127.0.0.1:54750         ESTABLISHED -
tcp        0      0 192.168.1.108:54406     192.168.1.108:2379      ESTABLISHED -

$ kubelet --version
Kubernetes v1.26.0

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-08T19:57:06Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

$ k get node -o wide
NAME                         STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
master-k8s.sadhanapath.com   Ready    control-plane   29h   v1.26.0   192.168.1.108   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   cri-o://1.25.2

But I am unable to figure out why certain errors are occurring and how to fix them. Appreciate some insights into the problem and possible solutions.

$ k get pod -A
NAMESPACE      NAME                                                 READY   STATUS    RESTARTS         AGE
default        hello-world-deployment-8679c476ff-jrc5j              0/1     Pending   0                96m
default        hello-world-deployment-8679c476ff-snftx              0/1     Pending   0                96m
default        my-pod                                               1/1     Running   0                130m
kube-flannel   kube-flannel-ds-rv72q                                1/1     Running   0                29h
kube-system    coredns-787d4945fb-l5g9l                             1/1     Running   0                29h
kube-system    coredns-787d4945fb-zvblc                             1/1     Running   0                29h
kube-system    etcd-master-k8s.sadhanapath.com                      1/1     Running   7                29h
kube-system    kube-apiserver-master-k8s.sadhanapath.com            1/1     Running   2                29h
kube-system    kube-controller-manager-master-k8s.sadhanapath.com   1/1     Running   280 (115s ago)   29h
kube-system    kube-proxy-kd2tl                                     1/1     Running   0                29h
kube-system    kube-scheduler-master-k8s.sadhanapath.com            1/1     Running   360 (112s ago)   29h
kube-system    metrics-server-6bf7778f96-xfrq4                      0/1     Pending   0                96m

my-pod (nginx)
~~~~~~~~~~~~~~~~
$ curl http://192.168.3.96
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Errors I noticed:


kube-apiserver log:
~~~~~~~~~~~~~~~~~~~~
I0117 16:08:30.322251       1 shared_informer.go:280] Caches are synced for garbage collector
I0117 16:08:30.376084       1 shared_informer.go:280] Caches are synced for garbage collector
I0117 16:08:30.376107       1 garbagecollector.go:163] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
E0117 16:14:32.331089       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://192.168.1.108:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0117 16:14:37.330451       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://192.168.1.108:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": context deadline exceeded
I0117 16:14:37.330505       1 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
E0117 16:14:37.330543       1 controllermanager.go:294] "leaderelection lost"
...
I0116 10:51:15.888114       1 alloc.go:327] "allocated clusterIPs" service="kube-system/kube-dns" clusterIPs=map[IPv4:192.168.4.10]
I0116 10:51:15.919529       1 controller.go:615] quota admission added evaluator for: daemonsets.apps
I0116 10:51:26.347486       1 controller.go:615] quota admission added evaluator for: replicasets.apps
I0116 10:51:27.150749       1 controller.go:615] quota admission added evaluator for: controllerrevisions.apps
E0116 10:55:10.060857       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, context canceled]"
E0116 10:55:10.060935       1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
E0116 10:55:10.060965       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0116 10:55:10.062138       1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0116 10:55:10.063299       1 timeout.go:142] post-timeout activity - time-elapsed: 2.437285ms, GET "/api" result: <nil>
{"level":"warn","ts":"2023-01-16T10:55:10.313Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004d8540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
E0116 10:55:10.313501       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}: context canceled
E0116 10:55:10.313566       1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
{"level":"warn","ts":"2023-01-16T10:55:10.313Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004d8540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
E0116 10:55:10.313616       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}: context canceled
E0116 10:55:10.314562       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0116 10:55:10.316213       1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
{"level":"warn","ts":"2023-01-16T10:55:10.317Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004d8540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
E0116 10:55:10.317381       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}: context canceled
{"level":"warn","ts":"2023-01-16T10:55:10.317Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004d8540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
E0116 10:55:10.317708       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}: context canceled
E0116 10:55:10.317970       1 timeout.go:142] post-timeout activity - time-elapsed: 5.19228ms, GET "/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler" result: <nil>
E0116 10:55:10.318756       1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
E0116 10:55:10.319894       1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout


etcd log:
~~~~~~~~~
{"level":"info","ts":"2023-01-17T16:16:26.767Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":181612138,"revision":119833,"compact-revision":119531}
WARNING: 2023/01/17 16:19:23 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
{"level":"info","ts":"2023-01-17T16:19:52.743Z","caller":"traceutil/trace.go:171","msg":"trace[2082787816] transaction","detail":"{read_only:false; response_revision:120259; number_of_response:1; }","duration":"112.693292ms","start":"2023-01-17T16:19:52.630Z","end":"2023-01-17T16:19:52.743Z","steps":["trace[2082787816] 'process raft request'  (duration: 112.557096ms)"],"step_count":1}
{"level":"info","ts":"2023-01-17T16:19:54.134Z","caller":"traceutil/trace.go:171","msg":"trace[708529753] transaction","detail":"{read_only:false; response_revision:120260; number_of_response:1; }","duration":"118.409797ms","start":"2023-01-17T16:19:54.016Z","end":"2023-01-17T16:19:54.134Z","steps":["trace[708529753] 'process raft request'  (duration: 118.295288ms)"],"step_count":1}
{"level":"info","ts":"2023-01-
17T16:19:56.267Z","caller":"traceutil/trace.go:171","msg":"trace[539414652] transaction","detail":"{read_only:false; response_revision:120261; number_of_response:1; }","duration":"126.094165ms","start":"2023-

kube-controller-maneger:
~~~~~~~~~~~~~~~~~~~~~~~~~
I0117 16:25:29.040164       1 shared_informer.go:280] Caches are synced for HPA
I0117 16:25:29.377578       1 shared_informer.go:280] Caches are synced for garbage collector
I0117 16:25:29.379762       1 shared_informer.go:280] Caches are synced for garbage collector
I0117 16:25:29.379789       1 garbagecollector.go:163] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
E0117 16:26:30.231228       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://192.168.1.108:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0117 16:26:35.231206       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://192.168.1.108:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": context deadline exceeded
I0117 16:26:35.231277       1 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
E0117 16:26:35.231330       1 controllermanager.go:294] "leaderelection lost"


kube-scheduler:
~~~~~~~~~~~~~~~
I0117 16:28:09.110528       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0117 16:28:09.110542       1 shared_informer.go:273] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0117 16:28:09.211149       1 shared_informer.go:280] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0117 16:28:09.211151       1 shared_informer.go:280] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0117 16:28:09.211261       1 leaderelection.go:248] attempting to acquire leader lease kube-system/kube-scheduler...
I0117 16:28:09.211927       1 shared_informer.go:280] Caches are synced for RequestHeaderAuthRequestController
I0117 16:28:09.223818       1 leaderelection.go:258] successfully acquired lease kube-system/kube-scheduler
E0117 16:28:56.592893       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Get "https://192.168.1.108:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0117 16:29:01.592295       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Get "https://192.168.1.108:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded
I0117 16:29:01.592378       1 leaderelection.go:283] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0117 16:29:02.745104       1 server.go:224] "Leaderelection lost"

I have ufw enabled with ports opened for 6443, 2379-2380,10250:10255, 30000:32767

I disabled ufw completely, yet scheduler and controller-manager restart counts increments and apiserver http timeouts, write failures continues ..

Yet I am able to create another nginx instance and run curl command against it... 

$ k get pod -o wide
NAME                                      READY   STATUS    RESTARTS   AGE     IP             NODE                         NOMINATED NODE   READINESS GATES
hello-world-deployment-8679c476ff-jrc5j   0/1     Pending   0          162m    <none>         <none>                       <none>           <none>
hello-world-deployment-8679c476ff-snftx   0/1     Pending   0          162m    <none>         <none>                       <none>           <none>
my-pod                                    1/1     Running   0          3h16m   192.168.3.96   master-k8s.sadhanapath.com   <none>           <none>
my-pod2                                   1/1     Running   0          15s     192.168.3.97   master-k8s.sadhanapath.com   <none>           <none>
$ curl http://192.168.3.97
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Appreciate any suggestions on how to solve this problem

1

There are 1 answers

0
Seeking.that On

Uninstalled kubernetes completely as shown here ... How to completely uninstall kubernetes

and reinstalled kubernetes from scratch https://www.linuxtechi.com/install-kubernetes-on-ubuntu-22-04/

I don't intend to run etcd cluster for now (https://learnk8s.io/etcd-kubernetes ), but a simple cluster with a couple of nodes, so I may not need etcdctl I suppose ..

Thanks