Jenkins slave pods on Kubernetes disappear when their is an influx of running pods

686 views Asked by At

I have a Kubernetes cluster running Jenkins master in a single pod and each build running in a separate slave pod. When there are many builds running, there are many pods being spun up and down and often I will see an error in a job like this:

Cannot contact slave-jenkins-0g9p0: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@197b6a38:JNLP4-connect connection from 10.10.3.90/10.10.3.90:54418": Remote call on JNLP4-connect connection from 10.10.3.90/10.10.3.90:54418 failed. The channel is closing down or has closed down
Could not connect to slave-jenkins-0g9p0 to send interrupt signal to process

The pod, for example slave-jenkins-0g9p0, just disappears. There is no trace that it existed. While watching information like kubectl describe pod slave-jenkins-0g9p0, there is no error message, it simply stops existing.

I have a feeling that because there are multiple pods spinning up and down that Kubernetes attempts to balance the load on the nodes and reschedule the pod but after killing it, it cannot spin up the pod on another node. I cannot be sure though. Maybe there is a way to tell K8s to tie a pod to a node until it exits itself? Im not really sure what/how to debug this case.

  • Kuberentes version: v1.16.13-eks-2ba888 on AWS EKS
  • Jenkins version: 2.257
  • Kubernetes plugin version 1.27.2

Any advise would be appreciated

Thanks

UPDATE:

I have uploaded three slave pod manifest examples here where you can see the resources allocated. The above issue occurs in each of these running pods.

The node pool is controlled by the Kubernetes autoscaler (v1.14.6) and use AWS t3a.large (2 CPU, 8GB mem) instances.

UPDATE 2:

I believe that I have found the cause of the problem. I disabled the cluster-autoscaler](https://github.com/kubernetes/autoscaler) (v1.14.6) and the problem stopped.

So what is seems is happening is that the autoscaler is removing the node that the slave pd is running on. I know that taints can be used to tell the autoscaler not to remove a node but is there a way to do this dynamically that it wont remove a node if a certain pod is running on it. Without having to develop something new.

0

There are 0 answers