In my project, GKE runs many jobs daily. Sometimes I see that a job runs twice: the first time partially and the second time fully, although "restartPolicy: Never" is defined. It happens very seldom (about one time per 200 - 300 runs).
This is an example:
I 2020-12-03T00:12:45Z Started container mot-test-deleteoldvalidations-container
I 2020-12-03T00:12:45Z Created container mot-test-deleteoldvalidations-container
I 2020-12-03T00:12:45Z Successfully pulled image "gcr.io/xxxxx/mot-del-old-validations:v16"
I 2020-12-03T00:12:40Z Pulling image "gcr.io/xxxxx/mot-del-old-validations:v16"
I 2020-12-03T00:12:39Z Stopping container mot-test-deleteoldvalidations-container
I 2020-12-03T00:01:59Z Started container mot-test-deleteoldvalidations-container
I 2020-12-03T00:01:59Z Created container mot-test-deleteoldvalidations-container
I 2020-12-03T00:01:59Z Successfully pulled image "gcr.io/xxxx/mot-del-old-validations:v16"
I 2020-12-03T00:01:40Z Pulling image "gcr.io/xxxxx/mot-del-old-validations:v16"
From job's YAML:
spec:
backoffLimit: 0
completions: 1
parallelism: 1
resources:
limits:
cpu: "1"
memory: 2500Mi
requests:
cpu: 500m
memory: 2Gi
nsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
The reason for stopping container is "Killing". How can I avoid this behavior?
As you mention in comment section, your
NetworkPolicy
is set toNever
. You have also setspec.backoffLimit
,spec.complementions
andspec.parallelism
which should work. However, the Documentation - Handling Pod and container failures mentioned that this behavior is possible and it's not considered as a problem.Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice.
As addition, in CronJob documentation, the best practise is to make jobs Idempotent.
As your whole
job manifest
is still a mystery, two workarounds come to my mind. Depends on the scenario it might help.First workaround
Use PodAntiAffinity which won't allow deploy the second pod with the same label/selector.
Second workaround
Use initContainer lock, so the first pod puts a lock, and the second pod, if lock is detected wait for 3-5 seconds and exit.