Can you prevent Google AI platform from terminating an evaluator before it's complete?

117 views Asked by At

I'm running a training job on the google AI platform, just training a simple tf.Estimator. Is there a way to prevent the whole job from completing if there's still an evaluation task running?

evaluator replica being killed by gcloud

1

There are 1 answers

0
rodvictor On

I remember someone using Kubeflow in GCP that needed to use the '--stream-logs' flag when submitting a AI Platform training job using the gcloud command (1). Otherwise, the job would get stopped before completion.

According to the documentation,

'with the --stream-logs flag, the job will continue to run after this command exits and must be cancelled with gcloud ai-platform jobs cancel JOB_ID)'

It is worth giving it a try and check if, in your case, this flag can also keep the job running instead of terminating it prematurely.

In the case that the issue kept happening when activating the flag, you might want to inspect the logs of the job to better understand the root cause of this behaviour.