Kubernetes limit number of retry

2.8k views Asked by At

For some context, I'm creating an API in python that creates K8s Jobs with user input in ENV variables.

Sometimes, it happens that the Image selected does not exist or has been deleted. Secrets does not exists or Volume isn't created. So it makes the Job in a crashloopbackoff or imagepullbackoff state.

First I'm am wondering if the ressource during this state are allocated to the job?

If yes, I don't want the Job to loop forever and lock resources to a never starting Job.

I've set the backofflimit to 0, but this is when the Job detect a Pod that goes in fail and tries to relaunch an other Pod to retry. In my case, I know that if a Pod fails for a job, then it's mostly due to OOM or code that fails and will always fails due to user input. So retrying will always fail.

But it doesn't limit the number of tries to crashloopbackoff or imagepullbackoff. Is there a way to set to terminate or fail the Job? I don't want to kill it, but just free the ressource and keep the events in (status.container.state.waiting.reason + status.container.state.waiting.message) or (status.container.state.terminated.reason + status.container.state.terminated.exit_code)

Could there be an option to set to limit the number of retry at the creation so I can free resources, but not to remove it to keep logs.

2

There are 2 answers

4
Bguess On

I have tested your first question and YES even if a pod is in crashloopbackoff state, the resources are still allocated to it !!! Here is my test: Are the Kubernetes requested resources by a pod still allocated to it when it is in crashLoopBackOff state?

Thanks for your question !

2
Mostafa Wael On

Long answer short, unfortunately there is no such option in Kubernetes.

However, you can do this manually by checking if the pod is in a crashloopbackoff then, unallocate its resources or simply delete the pod itself.

The following script delete any pod in the crashloopbackoff state from a specified namespace

#!/bin/bash
# This script check the passed namespace and delete pods in 'CrashLoopBackOff state 

NAMESPACE="test"
delpods=$(sudo kubectl get pods -n ${NAMESPACE} |
  grep -i 'CrashLoopBackOff' |
  awk '{print $1 }')    

for i in ${delpods[@]}; do

  sudo kubectl delete pod $i --force=true --wait=false \
    --grace-period=0 -n ${NAMESPACE}
    
done

Since we have passed the option --grace-period=0 the pod won't automatically restart again. But, if after using this script or assigning it to a job, you noticed that the pod continues to restart and fall in the CrashLoopBackOff state again for some weird reason. Thera is a workaround for this, which is changing the restart policy of the pod:

A PodSpec has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always. restartPolicy applies to all Containers in the Pod. restartPolicy only refers to restarts of the Containers by the kubelet on the same node. Exited Containers that are restarted by the kubelet are restarted with an exponential back-off delay (10s, 20s, 40s …) capped at five minutes, and is reset after ten minutes of successful execution. As discussed in the Pods document, once bound to a node, a Pod will never be rebound to another node.

See more details in the documentation or from here.

And that is it! Happy hacking.

Regarding the first question, it is already answered by bguess here.