How do you automate installation of NVIDIA drivers with a compute image VM from nvidia-ngc-public on GCP?

458 views Asked by At

I am trying to use the images found here to deploy a VM to GCP's Compute Engine with a GPU enabled. I have successfully created a VM from a publicly available NVIDIA image (e.g. nvidia-gpu-cloud-image-2022061 from the nvidia-ngc-public project) to create a VM, but the VM forces a prompt to install drivers upon being started. So, I have to SSH into the VM to manually install the GPU drivers by answering 'y' to the install drivers prompt. It will then install the drivers.

My issue is that I need to automate this GPU driver installation process so that I can cleanly and deterministically (fixed driver version) create these images with drivers installed via CI/CD pipelines. What is the best way to achieve this automation? I would like to avoid creating my own base image and installing all the drivers/dependencies if possible.

I have created a VM with this image using the following command:

gcloud compute instances create $INSTANCE_NAME --project=$PROJECT --zone=$ZONE --machine-type=n1-standard-16 \--maintenance-policy=TERMINATE --network-interface=network-tier=PREMIUM, subnet=default --service-account=my-service-account@$PROJECT.iam.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=1,type=nvidia-tesla-t4 --image=nvidia-gpu-cloud-image-2022061 --image-project=nvidia-ngc-public --boot-disk-size=200 --boot-disk-type=pd-standard --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any --no-restart-on-failure

I have then SSH'd into the VM and answered yes to the prompt.

I have then saved the image using gcloud compute images create --source-disk $INSTANCE_NAME for future use.

How can I automate this cleanly?

3

There are 3 answers

2
devocide On

You can use scripts to automate the installation process. To review these scripts, see the GitHub repository: https://github.com/GoogleCloudPlatform/compute-gpu-installation

0
Manya Agarwal On

Choose a GPU-enabled VM image that supports NVIDIA GPUs. NVIDIA NGC provides optimized images for various frameworks, such as TensorFlow, PyTorch, etc.Choose a GPU-enabled VM image that supports NVIDIA GPUs. NVIDIA NGC provides optimized images for various frameworks, such as TensorFlow, PyTorch, etc.Create a VM instance using the selected NGC image. Make sure to select the appropriate GPU type for your instance.GCP allows you to specify a startup script when creating a VM instance. You can use this script to automate the installation of NVIDIA drivers and other necessary components.Create a startup script that installs the NVIDIA GPU drivers. The exact commands might depend on the Linux distribution used by the NGC image. Below is an example script for Ubuntu:

#!/bin/bash

# Add the NVIDIA repository
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt-get update

# Install the NVIDIA driver
sudo apt-get install -y nvidia-driver-<version>

# Reboot to apply changes
sudo reboot
Replace <version> with the appropriate version number.
When creating the VM instance, specify the startup script by providing the --metadata startup-script=YOUR_SCRIPT_FILE flag. Replace YOUR_SCRIPT_FILE with the path to your startup script.
Example using gcloud command:
gcloud compute instances create INSTANCE_NAME \
  --image IMAGE_NAME \
  --metadata startup-script=YOUR_SCRIPT_FILE \
  --accelerator type=nvidia-tesla-v100,count=1

Once the VM is created, connect to it using SSH to monitor the installation progress and troubleshoot any issues. Please check the latest documentation for GCP and NGC, as well as the documentation for the specific NGC image you are using, as procedures might change over time. Additionally, make sure that you comply with any licensing agreements or terms of use for the NGC images and NVIDIA drivers.

0
Ayush M Gowda On

gcloud compute instances create $INSTANCE_NAME --project=$PROJECT --zone=$ZONE --machine-type=n1-standard-16 --maintenance-policy=TERMINATE --network-interface=network-tier=PREMIUM, subnet=default --service-account=my-service-account@$PROJECT.iam.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=1,type=nvidia-tesla-t4 --image=nvidia-gpu-cloud-image-2022061 --image-project=nvidia-ngc-public --boot-disk-size=200 --boot-disk-type=pd-standard --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any --no-restart-on-failure