Can not find NVIDIA driver after stop and start a deep learning VM

3.6k views Asked by At

[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.

I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi, I got the following error message:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

But if I type which nvidia-smi, I got

/usr/bin/nvidia-smi

It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.

The system information is (using uname -m && cat /etc/*release):

x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

I tried the installation scripts from GCP. First run

curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py

And then run

sudo python3 install_gpu_driver.py

which gives the following message:

Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.

3

There are 3 answers

1
zudi On

After posting the question, the Nvidia driver starts to work properly after waiting for a couple of minutes.

In the following days, I tried stopping/starting the VM instance multiple times. Sometimes nvidia-smi directly works, sometimes does not after >20 min waiting. My current best answer to this question is first waiting for several minutes. If nvidia-smi still does not work, stop and start the instance again.

1
morpheus On

also ran into this issue. if it helps someone, running following command [1] fixed it for us:

$ sudo apt-get install linux-headers-`uname -r`

this was on debian 11.

log

1
Rafael Toledo On

What worked for me (not sure if it will go well to next starts) was to remove all drivers: sudo apt remove --purge '*nvidia*', and then force the installation with sudo python3 install_gpu_driver.py.

In the install_gpu_driver.py, change line 230 to return False inside of the check_driver_installed function. Then, run the script.

Who uses docker may face this error docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] and have to reinstall the docker too. This thread helped me.