rapids cannot import cudf: Error at driver init: Call to cuInit results in CUDA_ERROR_NO_DEVICE (100)

919 views Asked by At

To install RAPIDS, i have already installed WSL2.

But i still got the following error when import cudf:

/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/utils/_ptxcompiler.py:61: UserWarning: Error getting driver and runtime versions:

stdout:



stderr:

Traceback (most recent call last):
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
    self.cuInit(0)
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 331, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 399, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [100] Call to cuInit results in CUDA_ERROR_NO_DEVICE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
    self.ensure_initialized()
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 262, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
...


Not patching Numba
  warnings.warn(msg, UserWarning)
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
---------------------------------------------------------------------------
CudaSupportError                          Traceback (most recent call last)
/mnt/d/learn-rapids/Untitled.ipynb Cell 4 line 1
----> 1 import cudf

File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/__init__.py:26
     20 from cudf.api.extensions import (
     21     register_dataframe_accessor,
     22     register_index_accessor,
     23     register_series_accessor,
     24 )
     25 from cudf.api.types import dtype
---> 26 from cudf.core.algorithms import factorize
     27 from cudf.core.cut import cut
     28 from cudf.core.dataframe import DataFrame, from_dataframe, from_pandas, merge

File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/core/algorithms.py:10
      8 from cudf.core.copy_types import BooleanMask
      9 from cudf.core.index import RangeIndex, as_index
---> 10 from cudf.core.indexed_frame import IndexedFrame
     11 from cudf.core.scalar import Scalar
     12 from cudf.options import get_option

File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/core/indexed_frame.py:59
     57 from cudf.core.dtypes import ListDtype
...
    302 if USE_NV_BINDING:
    303     return self._cuda_python_wrap_fn(fname)

CudaSupportError: Error at driver init: 
Call to cuInit results in CUDA_ERROR_NO_DEVICE (100):

Tried the latest install line below:

conda create --solver=libmamba -n rapids-23.12 -c rapidsai-nightly -c conda-forge -c nvidia  \
    cudf=23.12 cuml=23.12 python=3.10 cuda-version=12.0 \
    jupyterlab
 NVIDIA-SMI 545.23.05              Driver Version: 545.84       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:01:00.0  On |                  Off |
| 30%   53C    P3              54W / 300W |   1783MiB / 49140MiB |     10%      Default |
|                                         |                      |                  N/A

Also that cudf has been in the conda env:

cudf                      23.12.00a       cuda12_py310_231028_g2a923dfff8_124    rapidsai-nightly
cuml                      23.12.00a       cuda12_py310_231028_gff635fc25_31    rapidsai-nightly

I also tried using numba-s in the wsl env, and found the following:

__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Minor Version Compatibility Available    : ?
CUDA Minor Version Compatibility Needed       : ?
CUDA Minor Version Compatibility In Use       : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None

__Warning log__
Warning (cuda): CUDA device initialisation problem. Message:Error at driver init: Call to cuInit results in CUDA_ERROR_NO_DEVICE (100)
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_quota_us
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_period_us

Seems like the CUDA is not initiated in wsl but when i run this command in windows prompt, it returns:

__CUDA Information__
CUDA Device Initialized                       : True
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Minor Version Compatibility Available    : ?
CUDA Minor Version Compatibility Needed       : ?
CUDA Minor Version Compatibility In Use       : ?
CUDA Detect Output:
Found 1 CUDA devices
id 0     b'NVIDIA RTX A6000'                              [SUPPORTED]
                      Compute Capability: 8.6
                           PCI Device ID: 0
                              PCI Bus ID: 1
                                    UUID: GPU-17e7be94-251e-a2d9-3924-d167c0e59a56
                                Watchdog: Enabled
                            Compute Mode: WDDM
             FP32/FP64 Performance Ratio: 32
Summary:
        1/1 devices are supported

CUDA Libraries Test Output:
None
__Warning log__
Warning (cuda): Probing CUDA failed (device and driver present, runtime problem?)
(cuda) <class 'FileNotFoundError'>: Could not find module 'cudart.dll' (or one of its dependencies). Try using the full path with constructor syntax.
2

There are 2 answers

0
ZKK On BEST ANSWER

The problem has been solved. Do the following to register in the nano .bashrc Under the wsl instance:

sudo nano .bashrc

Insert the followings:

export LD_LIBRARY_PATH="/usr/lib/wsl/lib/"  
export NUMBA_CUDA_DRIVER="/usr/lib/wsl/lib/libcuda.so.1"

And then:

source .bashrc
0
jtromans On

In case this helps anyone else, I received a similar error numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_OUT_OF_MEMORY (2) with the following system configuration:

  • Host OS Microsoft Windows 11 Pro Version 10.0.22621 Build 22621
  • Running latest NVIDIA drivers (546.33) on host
  • WSL2 with fresh install of Ubuntu 22.04.3 LTS
  • Installed Miniconda3-py310_23.11.0-2-Linux-x86_64.sh
  • Installed RAPIDS via WSL2 Conda Install (Preferred Method)
  • Specific command executed in WSL2 conda create --solver=libmamba -n rapids-23.12 -c rapidsai -c conda-forge -c nvidia rapids=23.12 python=3.10 cuda-version=12.0
  • Activated the newly created rapids-23.12 Conda environment

In my case, because I have 4 discrete GPUs, this was confusing things inside WSL.

My bug is limited to those folks using WSL2 and have more than one GPU present in their set-up. I recall reading that WSL2 only supports 1 GPU (https://docs.rapids.ai/install#wsl2-conda : "Only single GPU is supported" and "GPU Direct Storage is not supported"). But it is not well documented that you need to help Python target the specific GPU that is supported.

To overcome this bug it is necessary to stipulate explicitly the CUDA_VISIBLE_DEVICES env variable and I would recommend doing so as an env variable in ~/.bashrc by adding the line : export CUDA_VISIBLE_DEVICES=0

Note this is zero indexed and is the ID of the GPU.

However, after some experimenting, I found that RAPIDS installation on WSL2 via Conda does support multiple GPUs, but in my case GPU ID 2 is what is causing the error, probably because it is fully used by the host OS or something like this. Given I have 4 GPUs, if I export CUDA_VISIBLE_DEVICES=0,1,2,3 and try to import cudf in Python, I error out as per above. But if I do export CUDA_VISIBLE_DEVICES=0,1,3 everything works normally.

In fact, running numba -s it recognised all 3 GPUs as 0, 1, 2 and therefore seems to reset it's index based on the GPUs the environment variable exposes. Further, when using XGBoost I can target all 3 GPUs exposed via the environment variable using IDs 0, 1, 2 respectively.