gke job container restarted

245 views Asked by At

In my project, GKE runs many jobs daily. Sometimes I see that a job runs twice: the first time partially and the second time fully, although "restartPolicy: Never" is defined. It happens very seldom (about one time per 200 - 300 runs).

This is an example:

I 2020-12-03T00:12:45Z Started container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:12:45Z Created container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:12:45Z Successfully pulled image "gcr.io/xxxxx/mot-del-old-validations:v16" 
I 2020-12-03T00:12:40Z Pulling image "gcr.io/xxxxx/mot-del-old-validations:v16" 
I 2020-12-03T00:12:39Z Stopping container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:01:59Z Started container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:01:59Z Created container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:01:59Z Successfully pulled image "gcr.io/xxxx/mot-del-old-validations:v16" 
I 2020-12-03T00:01:40Z Pulling image "gcr.io/xxxxx/mot-del-old-validations:v16" 

From job's YAML:

spec:
  backoffLimit: 0
  completions: 1
  parallelism: 1
resources:
          limits:
            cpu: "1"
            memory: 2500Mi
          requests:
            cpu: 500m
            memory: 2Gi
        nsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes: 

The reason for stopping container is "Killing". How can I avoid this behavior?

1

There are 1 answers

0
PjoterS On

As you mention in comment section, your NetworkPolicy is set to Never. You have also set spec.backoffLimit, spec.complementions and spec.parallelism which should work. However, the Documentation - Handling Pod and container failures mentioned that this behavior is possible and it's not considered as a problem.

Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice.

As addition, in CronJob documentation, the best practise is to make jobs Idempotent.

A cron job creates a job object about once per execution time of its schedule. We say "about" because there are certain circumstances where two jobs might be created, or no job might be created. We attempt to make these rare, but do not completely prevent them. Therefore, jobs should be idempotent.

In computing, an idempotent operation is one that has no additional effect if it is called more than once with the same input parameters. For example, removing an item from a set can be considered an idempotent operation on the set.

As your whole job manifest is still a mystery, two workarounds come to my mind. Depends on the scenario it might help.

First workaround

Use PodAntiAffinity which won't allow deploy the second pod with the same label/selector.

Second workaround

Use initContainer lock, so the first pod puts a lock, and the second pod, if lock is detected wait for 3-5 seconds and exit.

Because init containers run to completion before any app containers start, init containers offer a mechanism to block or delay app container startup until a set of preconditions are met.