How to effectively use the TFRC program with the GCP AI platform Jobs

Question

How to effectively use the TFRC program with the GCP AI platform Jobs

454 views Asked by Miguel Diaz At 24 November 2020 at 00:33

I'm trying to run a hyperparameter tunning job into GCP's AI platform job service, the Tensorflow Research Cloud program approved to me

100 preemptible Cloud TPU v2-8 device(s) in zone us-central1-f
20 on-demand Cloud TPU v2-8 device(s) in zone us-central1-f
5 on-demand Cloud TPU v3-8 device(s) in zone europe-west4-a

I already built a custom model on Tensorflow 2, and I want to run the job specifying the exact zone to take advantage of the TFRC program plus the AI platform job service; right now I have a YAML config file that looks like:

trainingInput:
  scaleTier: basic-tpu
  region: us-central1
  hyperparameters:
    goal: MAXIMIZE
    hyperparameterMetricTag: val_accuracy
    maxTrials: 100
    maxParallelTrials: 16
    maxFailedTrials: 30
    enableTrialEarlyStopping: True

In theory, if I run 16 parallel jobs each one in a separate TPU instance should work but, instead return an error due to the petition exceed the quota of TPU_V2

ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project ###################. The request for 128 TPU_V2 accelerators for 16 parallel runs exceeds the allowed maximum of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 30 K80, 30 P100, 6 T4 accelerators.

Then I reduce the maxParallelTrials to only 2 and worked, which confirms given the above error message the quota is counting by TPU chip, not by TPU instance.

Therefore I think, maybe I completely misunderstood the approved quota of the TFRC program then I proceed to check if the job is using the us-central1-f zone but turns out that is using an unwanted zone:

-tpu_node={"project": "p091c8a0a31894754-tp", "zone": "us-central1-c", "tpu_node_name": "cmle-training-1597710560117985038-tpu"}"

That behavior doesn't allow me to use effectively the free approved quota, and if I understand correctly the job running in the us-central1-c is taking credits of my account but does not use the free resources. Hence I wonder if there's some way to set the zone in the AI platform job, and also it is possible to pass some flag to use preemptible TPUs.

Original Q&A

There are 1 answers

**chrislarkin** · Accepted Answer · 2020-11-24T09:26:37+00:00

chrislarkin On 24 November 2020 at 09:26 BEST ANSWER

Unfortunately the two can't be combined.

TechQA.

How to effectively use the TFRC program with the GCP AI platform Jobs

There are 1 answers

Related Questions in TENSORFLOW2.0

Related Questions in TPU

Related Questions in GOOGLE-CLOUD-TPU

Related Questions in GCP-AI-PLATFORM-TRAINING

Related Questions in GOOGLE-AI-PLATFORM

Popular Questions

Popular Tags

Trending Questions