I'm trying to run a hyperparameter tunning job into GCP's AI platform job service, the Tensorflow Research Cloud program approved to me
- 100 preemptible Cloud TPU v2-8 device(s) in zone us-central1-f
- 20 on-demand Cloud TPU v2-8 device(s) in zone us-central1-f
- 5 on-demand Cloud TPU v3-8 device(s) in zone europe-west4-a
I already built a custom model on Tensorflow 2, and I want to run the job specifying the exact zone to take advantage of the TFRC program plus the AI platform job service; right now I have a YAML config file that looks like:
trainingInput:
scaleTier: basic-tpu
region: us-central1
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: val_accuracy
maxTrials: 100
maxParallelTrials: 16
maxFailedTrials: 30
enableTrialEarlyStopping: True
In theory, if I run 16 parallel jobs each one in a separate TPU instance should work but, instead return an error due to the petition exceed the quota of TPU_V2
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project ###################. The request for 128 TPU_V2 accelerators for 16 parallel runs exceeds the allowed maximum of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 30 K80, 30 P100, 6 T4 accelerators.
Then I reduce the maxParallelTrials to only 2 and worked, which confirms given the above error message the quota is counting by TPU chip, not by TPU instance.
Therefore I think, maybe I completely misunderstood the approved quota of the TFRC program then I proceed to check if the job is using the us-central1-f zone but turns out that is using an unwanted zone:
-tpu_node={"project": "p091c8a0a31894754-tp", "zone": "us-central1-c", "tpu_node_name": "cmle-training-1597710560117985038-tpu"}"
That behavior doesn't allow me to use effectively the free approved quota, and if I understand correctly the job running in the us-central1-c is taking credits of my account but does not use the free resources. Hence I wonder if there's some way to set the zone in the AI platform job, and also it is possible to pass some flag to use preemptible TPUs.
Unfortunately the two can't be combined.