I have a pretty standard CI
pipeline using Cloud Build
for my Machine Learning training model based on container:
- check python error use flake8
- check syntax and style issue using pylint, pydocstyle ...
- build a base container (CPU/GPU)
- build a specialized ML container for my model
- check the vulnerability of the packages installed
- run tests units
Now in Machine Learning it is impossible to validate a model without testing it with real data. Normally we add 2 extra checks:
- Fix all random seed and run on a test data to see if we find the exact same results
- Train the model on a batch and see if we can over fit and have the loss going to zero
This allow to catch issues inside the code of model. In my setup, I have my Cloud Build
in a build GCP
project and the data in another GCP
project.
Q1: did somebody managed to use AI Platform training
service in Cloud Build
to train on data sitting in another GCP
project ?
Q2: how to tell Cloud Build to wait until the AI Platform training
job finished and check what is the status (successful/failed) ? It seems that the only option when looking at the documentation link it to use --stream-logs
but it seems non optimal (using such option, I saw some huge delay)
When you submit an AI platform training job, you can specify a service account email to use.
Be sure that the service account has enough authorization in the other project to use data from there.
For you second question, you have 2 solutions
--stream-logs
as you mentioned. If you don't want the logs in your Cloud Build, you can redirect the stdout and/or the stderr to/dev/null
Or you can create an infinite loop that check the status
Here my test is simple, but you can customize the status tests as you want to match your requirement
Don't forget to set the timeout as expected.