Have microk8s running on two nodes. Recently it got into a state where the master node fails to go into Ready
status because microk8s.daemon-containerd
service fails to start. This started happening after trying to get cert-manager
configuration running in the k8s cluster.
As far as I can see cert-manager-webhook
pod is running on second node okay.
I have tried microk8s stop
/microk8s start
. I have even tried microk8s reset
at this point but containerd always shows same error.
Outputs:
$ kubectl get node
NAME STATUS ROLES AGE VERSION
pi-k8s-00 NotReady <none> 77d v1.18.6-1+b4f4cb0b7fe3c1
pi-k8s-01 Ready <none> 77d v1.19.2-34+37bbd8cebecb60
$ kubectl get pod -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-676b755d5f-6bjxv 1/1 Running 0 12m
cert-manager-cainjector-795f67b984-tsmw9 1/1 Running 3 12m
cert-manager-webhook-86c4dcd4b5-bgrmb 1/1 Running 0 12m
$ sudo journalctl -u snap.microk8s.daemon-containerd
...
Oct 17 10:42:33 pi-k8s-00 microk8s.daemon-containerd[44363]: time="2020-10-17T10:42:33.848409047Z" level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve sandbox name \"cert-manager-webhook>
Oct 17 10:42:33 pi-k8s-00 systemd[1]: snap.microk8s.daemon-containerd.service: Main process exited, code=exited, status=1/FAILURE
Oct 17 10:42:33 pi-k8s-00 systemd[1]: snap.microk8s.daemon-containerd.service: Failed with result 'exit-code'.
Oct 17 10:42:34 pi-k8s-00 systemd[1]: snap.microk8s.daemon-containerd.service: Scheduled restart job, restart counter is at 5.
Oct 17 10:42:34 pi-k8s-00 systemd[1]: Stopped Service for snap application microk8s.daemon-containerd.
Oct 17 10:42:34 pi-k8s-00 systemd[1]: snap.microk8s.daemon-containerd.service: Start request repeated too quickly.
Oct 17 10:42:34 pi-k8s-00 systemd[1]: snap.microk8s.daemon-containerd.service: Failed with result 'exit-code'.
Oct 17 10:42:34 pi-k8s-00 systemd[1]: Failed to start Service for snap application microk8s.daemon-containerd.
$ uname -a
Linux pi-k8s-00 5.4.0-1021-raspi #24-Ubuntu SMP PREEMPT Mon Oct 5 09:59:23 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
How can I get the master node back in a good running/ready state?
--- UPDATE ---
Output:
$ less /var/snap/microk8s/current/inspection-report/snap.microk8s.daemon-containerd/journal.log
Oct 18 14:48:03 pi-k8s-00 microk8s.daemon-containerd[239043]: time="2020-10-18T14:48:03.936439781Z" level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve sandbox name \"cert-manager-webhook-64b9b4fdfd-9d6tm_cert-manager_81fb08ac-7e87-42bd-9123-b0b8b098fe50_3\": name \"cert-manager-webhook-64b9b4fdfd-9d6tm_cert-manager_81fb08ac-7e87-42bd-9123-b0b8b098fe50_3\" is reserved for \"149b0aa92e3eb042f87353ead44a7247e756c8071f804bfbec3b781a5565e52c\""
This last log shows that the sandbox name is reserved for a given id.
What id would that be? And where do I go and what should one do to free things up?
Looking through comments in 'failed to reserve sandbox name' error after hard reboot #1014 I tried:
$ sudo ctr -n=k8s.io containers info 149b0aa92e3eb042f87353ead44a7247e756c8071f804bfbec3b781a5565e52c
ctr: container "149b0aa92e3eb042f87353ead44a7247e756c8071f804bfbec3b781a5565e52c" in namespace "k8s.io": not found
But as can be seen from the output no container with that id exists?
It seems the containerd data had got corrupted and so the way to resolve this issue was to recreate the containerd data by doing:
Kubernetes master node is once again showing with status
Ready
:See my post on microk8s github issues page Failed to Reserve Sandbox Name for more details.