Reserved GPU instances not being allocated in Dataflow

98 views Asked by At

I am attempting to get a GPU instance running with Google Dataflow but I cannot proceed as I keep experiencing the error

Workflow failed. Causes: Project XXX has insufficient resource(s) to execute this workflow with 1 instances in region us-east1. Resource summary (required/available): 1/23 instances, 8/15 CPUs, 1/0 NVIDIA P100 GPUs, 25/3896 disk GB, 0/500 SSD disk GB, 1/100 instance groups, 1/150 managed instance groups, 1/100 instance templates, 1/7 in-use IP addresses.

I have a custom container using a flex template and have followed all the setup instructions. I have a single reserved VM as per the screenshot (n1-standard-8 with a a P100 GPU attached). I have it set to use the reservation automatically

Reserved VM Settings

and have launched the job as follows :

gcloud dataflow flex-template run \
        us-east-p100-run \
        --template-file-gcs-location=gs://my-project-pipeline/dataflow/templates/main.json \
        --worker-zone=us-east1-b \
        --region=us-east1 \
        --worker-machine-type=n1-standard-8 \
        --parameters=sdk_container_image=XXX:latest
        --additional experiments="dataflow_service_options=automatically_use_created_reservation,worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver"

On the job dashboard I can see the experiments to use both the reservation has been passed through and that the P100 accelerator should be used :

experiments: ['dataflow_service_options=automatically_use_created_reservation', 'worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver']

Other info : I am using Dataflow runner v2 ,Apache Beam Python 3.10 SDK 2.48.0

I have tried various permutations of Zone/Region/GPU instances but the Dataflow job continually tries and fails to allocate enough resource for it's execution. Any help would be greatly appreciated.

0

There are 0 answers