exiting app in a daemonset's pod does not create a new pod, but restart the existing one

48 views Asked by At

I'm facing a problem with a kubernetes daemon set that may imply that I don't understand correctly how pods work

I have a daemonset which runs an app that requires nvidia's nvml library I'm running this daemonset in a containerd cluster using nvidia's gpu operator

When I join a new node to the cluster with a GPU, the nvidia's gpu operator will take some time to install drivers and container runtime. Meanwhile, my daemonset will start the pod on the node, which will failed to initialize NVML. I then force my app to exit after a 60 period, hoping that the daemonset will start a new pod. But in fact, it restarts the same pod (same name), with the same context, and my app will never be able to initialize NVML

If I delete the pod manually, a new pod (new name) is created and my app can initialize NVML right away If I rollout restart my daemonset, a new pod (new name) is created and my app can initialize NVML right away

How can I get the same behavior automatically ? I thought that exiting the app would stop the pod and force the daemonset to create a new one, but it does not seem to be the case

My goal is to get my app up and running automatically when I join a new node, once nvidia's gpu-operator has finished all its stuff (drive install, container runtime install, node labeling, etc...)

Thanks for any clue

0

There are 0 answers