I am attempting to get a GPU instance running with Google Dataflow but I cannot proceed as I keep experiencing the error
Workflow failed. Causes: Project XXX has insufficient resource(s) to execute this workflow with 1 instances in region us-east1. Resource summary (required/available): 1/23 instances, 8/15 CPUs, 1/0 NVIDIA P100 GPUs, 25/3896 disk GB, 0/500 SSD disk GB, 1/100 instance groups, 1/150 managed instance groups, 1/100 instance templates, 1/7 in-use IP addresses.
I have a custom container using a flex template and have followed all the setup instructions. I have a single reserved VM as per the screenshot (n1-standard-8 with a a P100 GPU attached). I have it set to use the reservation automatically
and have launched the job as follows :
gcloud dataflow flex-template run \
us-east-p100-run \
--template-file-gcs-location=gs://my-project-pipeline/dataflow/templates/main.json \
--worker-zone=us-east1-b \
--region=us-east1 \
--worker-machine-type=n1-standard-8 \
--parameters=sdk_container_image=XXX:latest
--additional experiments="dataflow_service_options=automatically_use_created_reservation,worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver"
On the job dashboard I can see the experiments to use both the reservation has been passed through and that the P100 accelerator should be used :
experiments: ['dataflow_service_options=automatically_use_created_reservation', 'worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver']
Other info : I am using Dataflow runner v2 ,Apache Beam Python 3.10 SDK 2.48.0
I have tried various permutations of Zone/Region/GPU instances but the Dataflow job continually tries and fails to allocate enough resource for it's execution. Any help would be greatly appreciated.