Cannot install NVIDIA GPU driver 470.82.01 on the on Google Kubernetes Engine 1.21

922 views Asked by At

I would like to run GPU nodes in a GKE cluster, that requires an installation DaemonSet. According to https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers, NVIDIA driver 470 is supported for the latest GKE version 1.21.

The default DaemonSet installs the driver version 450 and my node works just fine:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

But, since I need 470, I also tried to deploy the latest DaemonSet:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

Unfortunately, using this latter command on the cluster, the GPU node never starts because the DaemonSet pod constantly fails and retries to run.

Update:

Finally I've got some logs. We've changed the DaemonSet config above to a Deployment and tried to manually launch the driver installation script /cos-gpu-installer install --version=latest but got the following error:

00:04.0 3D controller: NVIDIA Corporation Device 1db1 (rev a1)
I0512 14:08:30.070789    8742 installer.go:401] Getting the latest GPU driver version
I0512 14:08:30.071255    8742 utils.go:88] Downloading gpu_latest_version from https://storage.googleapis.com/cos-tools/16108.604.19/gpu_latest_version
I0512 14:08:30.173334    8742 install.go:132] Installing GPU driver version 470.82.01
I0512 14:08:30.173409    8742 cache.go:72] map[BUILD_ID:16108.604.19 DRIVER_VERSION:450.119.04]
I0512 14:08:30.173490    8742 installer.go:102] Configuring driver installation directories
I0512 14:08:30.467320    8742 signature.go:30] Downloading driver signature for version 470.82.01
I0512 14:08:30.467360    8742 utils.go:88] Downloading 470.82.01.signature.tar.gz from https://storage.googleapis.com/cos-tools/16108.604.19/extensions/gpu/470.82.01.signature.tar.gz
I0512 14:08:30.470927    8742 signature.go:37] Decompressing signature /build/sign-gpu-driver/470.82.01.signature.tar.gz
I0512 14:08:30.476162    8742 installer.go:92] Downloading GPU driver installer version 470.82.01
I0512 14:08:30.477134    8742 utils.go:88] Downloading GPU driver installer from https://storage.googleapis.com/nvidia-drivers-eu-public/nvidia-cos-project/89/tesla/470_00/470.82.01/NVIDIA-Linux-x86_64-470.82.01_89-16108-604-19.cos
I0512 14:08:32.361778    8742 utils.go:88] Downloading toolchain_env from https://storage.googleapis.com/cos-tools/16108.604.19/toolchain_env
I0512 14:08:32.371396    8742 cos.go:71] Installing the toolchain
I0512 14:08:32.371467    8742 cos.go:77] Found existing toolchain. Skipping download and installation
I0512 14:08:32.371506    8742 installer.go:288] Running GPU driver installer
I0512 14:08:40.924340    8742 installer.go:139] Extracting precompiled artifacts...
I0512 14:08:41.090412    8742 installer.go:166] Done extracting precompiled artifacts
I0512 14:08:41.090441    8742 installer.go:171] Linking drivers...
I0512 14:08:41.090512    8742 installer.go:192] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.4.170+/scripts/module-common.lds -r -o /tmp/extract/kernel/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0512 14:08:41.505273    8742 installer.go:203] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.4.170+/scripts/module-common.lds -r -o /tmp/extract/kernel/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0512 14:08:41.519532    8742 installer.go:219] Done linking drivers
I0512 14:08:41.943926    8742 modules.go:69] Loading gpu-key to secondary system keyring
I0512 14:08:41.947421    8742 modules.go:81] Successfully load key gpu-key into secondary system keyring.
I0512 14:08:41.954888    8742 installer.go:265] Installing userspace libraries...
I0512 14:08:41.954921    8742 installer.go:277] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license --no-kernel-module]
E0512 14:08:41.962811    8742 utils.go:355]
E0512 14:08:41.963442    8742 utils.go:355] WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that an NVIDIA kernel module matching this driver version is installed separately.
E0512 14:08:41.963493    8742 utils.go:355]
E0512 14:08:41.963592    8742 utils.go:355]
E0512 14:08:41.963613    8742 utils.go:355] WARNING: nvidia-installer was forced to guess the X library path '/usr/local/nvidia/lib64' and X module path '/usr/local/nvidia/lib64/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
E0512 14:08:41.963629    8742 utils.go:355]
I0512 14:08:46.400148    8742 installer.go:281] Done installing userspace libraries
I0512 14:08:46.400262    8742 cache.go:58] Updated cached version as
I0512 14:08:46.400285    8742 cache.go:60] BUILD_ID=16108.604.19
I0512 14:08:46.400292    8742 cache.go:60] DRIVER_VERSION=470.82.01
I0512 14:08:46.400329    8742 installer.go:45] Verifying GPU driver installation
E0512 14:08:46.432573    8742 install.go:276] failed to verify installation: failed to verify GPU driver installation: exit status 255
0

There are 0 answers