My Tensorflow training job is exiting with a non-zero status of 1 and is not giving any helpful error messages. The traceback looks like it's hidden [...] and the link provided is similar. Here's what the logs are outputting:
I have checked the service account which has a role as Cloud ML Service Agent which has permissions for logging.logEntries.create. The description of the Cloud ML Service agent also states:
Cloud ML service agent can act as log writer, Cloud Storage admin, Artifact Registry Reader, BigQuery writer, and service account access token creator.
So i'm assuming that it has permissions to write logs to the logger... My question is how do i troubleshoot why my job is failing with this?
This could be your training vm instance doesn't have enough permission to write logs.Get service account name of the VM,Go to IAM roles and assign
Log writer
role to the service account.