How to submit a GCP AI Platform training job frominside a GCP Cloud Build pipeline?

503 views Asked by At

I have a pretty standard CI pipeline using Cloud Build for my Machine Learning training model based on container:

  • check python error use flake8
  • check syntax and style issue using pylint, pydocstyle ...
  • build a base container (CPU/GPU)
  • build a specialized ML container for my model
  • check the vulnerability of the packages installed
  • run tests units

Now in Machine Learning it is impossible to validate a model without testing it with real data. Normally we add 2 extra checks:

  • Fix all random seed and run on a test data to see if we find the exact same results
  • Train the model on a batch and see if we can over fit and have the loss going to zero

This allow to catch issues inside the code of model. In my setup, I have my Cloud Build in a build GCP project and the data in another GCP project.

Q1: did somebody managed to use AI Platform training service in Cloud Build to train on data sitting in another GCP project ?

Q2: how to tell Cloud Build to wait until the AI Platform training job finished and check what is the status (successful/failed) ? It seems that the only option when looking at the documentation link it to use --stream-logsbut it seems non optimal (using such option, I saw some huge delay)

1

There are 1 answers

0
guillaume blaquiere On BEST ANSWER

When you submit an AI platform training job, you can specify a service account email to use.

Be sure that the service account has enough authorization in the other project to use data from there.

For you second question, you have 2 solutions

  • Use --stream-logs as you mentioned. If you don't want the logs in your Cloud Build, you can redirect the stdout and/or the stderr to /dev/null
- name: name: 'gcr.io/cloud-builders/gcloud'
  entrypoint: 'bash'
  args:
    - -c
    - |
         gcloud ai-platform jobs submit training <your params> --stream-logs >/dev/null 2>/dev/null

Or you can create an infinite loop that check the status

- name: name: 'gcr.io/cloud-builders/gcloud'
  entrypoint: 'bash'
  args:
    - -c
    - |
        JOB_NAME=<UNIQUE Job NAME>
        gcloud ai-platform jobs submit training $${JOB_NAME} <your params> 
        # test the job status every 60 seconds
        while [ -z "$$(gcloud ai-platform jobs describe $${JOB_NAME} | grep SUCCEEDED)" ]; do sleep 60; done

Here my test is simple, but you can customize the status tests as you want to match your requirement

Don't forget to set the timeout as expected.