I'm working in GoogleCloud and I have create a virtual machine with the following specs:
- Machine: a2-highgpu-1g
- CPU platfor: Intel Cascade Lake
- GPU: 1 x NVIDIA A100 40GB
I use this machine to train and test different RNN models and it was working fine till last friday ( 8th of September 2023 ) and today suddenly my models are not able to use the GPU anymore. If i run
torch.cuda.is_available()
the result is false. Someone could give me sime hints to what could be happened since the last usage since the GPU is not available anymore? Thanks.
Edit: I have used it since Friday but then for the weekend I kept the VM on but never used it. Maybe they restricted my account because I was occupying a machine without using it?
Edit 2: I notice that the command: lshw -class display returns:
*-display UNCLAIMED
description: 3D controller
product: GA100 [A100 SXM4 40GB]
vendor: NVIDIA Corporation
physical id: 4
bus info: pci@0000:00:04.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: msix pm bus_master cap_list
configuration: latency=0
resources: iomemory:200-1ff iomemory:300-2ff memory:80000000-80ffffff memory:2000000000-2fffffffff memory:3000000000-3001ffffff
Surfing on internet I found that "display UNCLAIMED" means that I do not have the proper driver. Is this right? Should I upgrade manually the driver on a GCloud VM?
Thanks again
Yes you can try to manually download the CUDA toolkit and pre-installation according to the instruction provided. Attaching the documentation for your reference