AI Platform Training Job exited with a non-zero status of 1. Termination reason: Error

217 views Asked by At

My Tensorflow training job is exiting with a non-zero status of 1 and is not giving any helpful error messages. The traceback looks like it's hidden [...] and the link provided is similar. Here's what the logs are outputting:

enter image description here

I have checked the service account which has a role as Cloud ML Service Agent which has permissions for logging.logEntries.create. The description of the Cloud ML Service agent also states:

Cloud ML service agent can act as log writer, Cloud Storage admin, Artifact Registry Reader, BigQuery writer, and service account access token creator.

So i'm assuming that it has permissions to write logs to the logger... My question is how do i troubleshoot why my job is failing with this?

1

There are 1 answers

0
Rajith Thennakoon On

This could be your training vm instance doesn't have enough permission to write logs.Get service account name of the VM,Go to IAM roles and assign Log writer role to the service account.