I have sidekiq custom metrics coming from prometheus adapter. Using thoes queue metrics from prometheus i have setup HPA. When jobs in queue in sidekiq goes above say 1000 jobs HPA triggers 10 new pods. Then each pod will execute 100 jobs in queue. When jobs are reduced to say 400. HPA will scale-down. But when scale-down happens, hpa kills pods say 4 pods are killed. Thoes 4 pods were still running jobs say each pod was running 30-50 jobs. Now when hpa deletes these 4 pods, jobs running on them are also terminated. And thoes jobs are marked as failed in sidekiq.
So what i want to achieve is stop hpa from deleting pods which are executing the jobs. Moreover i want hpa to not scale-down even after load is reduced to minimum, instead delete pods when jobs in queue in sidekiq metrics is 0.
Is there any way to achieve this?
As per the discussion and the work done by #Hb_1993 it can be done with a pre-stop hook to delay the eviction, where the delay is based on operation time or some logic to know if the procession is done or not.
A pre-stop hook is a lifecycle method which is invoked before a pod is evicted, and we can then attach to this event and perform some logic like performing ping check to make sure that our pod has completed the processing of current request.
PS- Use this solution with a pinch of salt as this might not work in all the cases or produce unintended results.
More details can be found in this article.
https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304