nvidia-smi version mismatch error when I try nvidia-smi

911 views Asked by At

when I try nvidia-smi I am getting this error:

Failed to initialize NVML: DRiver/library version mismatch

But when I try nvcc --version, getting this output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

I test this lsmod | grep nvidia and output is this:

nvidia_uvm           1200128  0
nvidia_drm             65536  14
nvidia_modeset       1200128  11 nvidia_drm
nvidia              35483648  1148 nvidia_uvm,nvidia_modeset
drm_kms_helper        307200  1 nvidia_drm
drm                   618496  18 drm_kms_helper,nvidia,nvidia_drm

which nvidia-smi output: /usr/bin/nvidia-smi

ps aux | grep nvidia-persistenced

output:

jasur     814285  0.0  0.0  18980  2772 pts/3    S+   13:04   0:00 grep --color=auto nvidia-persistenced

I am using ubuntu 2004

I test this lsmod | grep nvidia and output is this:

nvidia_uvm           1200128  0
nvidia_drm             65536  14
nvidia_modeset       1200128  11 nvidia_drm
nvidia              35483648  1148 nvidia_uvm,nvidia_modeset
drm_kms_helper        307200  1 nvidia_drm
drm                   618496  18 drm_kms_helper,nvidia,nvidia_drm

which nvidia-smi output: /usr/bin/nvidia-smi

ps aux | grep nvidia-persistenced output:

jasur     814285  0.0  0.0  18980  2772 pts/3    S+   13:04   0:00 grep --color=auto nvidia-persistenced
1

There are 1 answers

0
JJBUP On

I also encountered this issue, but I have resolved it.

Firstly, many people can resolve this problem by restarting the system, so you can try that. If that doesn't work, you may need to reinstall the NVIDIA driver. I am using an LXC container, and due to the container sharing the kernel of the NVIDIA driver, if the container is inadvertently upgraded, a mismatch between the client and kernel versions of the NVIDIA driver will occur. Using nvidia-smi will result in:

Failed to initialize NVML: Driver/library version mismatch We can get detailed information about this error by running:

dmesg | grep NVRM

get:

[47275.695113] NVRM: API mismatch: the client has the version 470.223.02, but
[47275.695113] NVRM: this kernel module has the version 470.182.03.  Please
[47275.695113] NVRM: make sure that this kernel module and all NVIDIA driver
[47275.695113] NVRM: components have the same version. 

Therefore, the simplest solution is to install the same driver version as the host machine.

We should clean up our original driver:

Uninstall the old version of the driver:

sudo apt-get purge nvidia*

Run the repair command:

sudo apt-get install -f

Here, it may prompt us to clean up libraries with no dependencies. You can clear them by :

sudo apt autoremove

I recommend manually searching for the installation package of the specified version, as using the following command may not specify a smaller version number and may differ, leading to the same error:

apt list -a nvidia-driver-470
apt install nvidia-driver-470

Therefore, we should manually find the driver version 470.182.03 to replace 470.223.02. I downloaded NVIDIA-Linux-x86_64-470.182.03.run.

Now we need to install our driver. Before installing the driver, make sure that no processes are running on our GPU. We can manually kill these processes or restart the host machine. Then, execute our installation:

sudo sh ./NVIDIA-Linux-x86_64-470.182.03.run --no-kernel-module

Because the GPU driver in the container does not need to install kernel files, we add --no-kernel-module at the end.

After installing the GPU driver in the container, restart and enter nvidia-smi to check if the driver is installed successfully.