Micro service[spring boot java application] is hosted on clustered environment and enabled hpa.
We are having few jobs which run daily. Those are the quartz jobs integration in java. We are using 2.3.0 version for quartz dependencies and spring-boot-starter-quartz 2.1.1 dependency.
A job is not executing on some day roughly [once in a week/2 weeks]. This is an intermittent issue.
So far when this happens, we re-trigger the job manually via API. However, want to resolve this permanently.
Observation in logs is that, The job executes on x pod. Next day, the pod is scaled down [because of hpa]. The job still executes on x pod which is already killed. Kt prints the logs that scheduler is starting and picking up the job. But, no error and the job just do not executes further. Then, on next day, the job executes successfully on another pod.
As there are no errors, not getting any clue on the cause.
Anybody know anything on this and how it can be resolved?
UPDATES==>>
[09 Nov 2022]After further analysis at infra side, we found that pod was killed with OOM error. This happened because of memory crunch on the node at that time.
So, now Question is how we can resume the job on another working pod if the pod is killed because of such reasons which are not in control of application/service. Anybody having idea on this?