I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.
When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.
I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:
a) How to diagnose in depth the reasons for node leaving the cluster.
b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?
PS: I am using EBS (gp2)
As per the docs, for the node to be 'Ready':
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
To see if you see anything unusual on your kube-apiserver. On your active master run:
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
Increase the
--node-monitor-grace-period
option in your kube-controller-manager. You can add it to the command line in the/etc/kubernetes/manifests/kube-controller-manager.yaml
and restart thekube-controller-manager
container.When the node is in the 'Unknown' state can you
ssh
into it and see if you can reach thekubeapi-server
? Both on<master-ip>:6443
and also thekubernetes.default.svc.cluster.local:443
endpoints.