RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable while GPU utilization is 0% according to nvidia-smi

Question

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable while GPU utilization is 0% according to nvidia-smi

907 views Asked by Revolucion for Monica At 09 December 2024 at 12:29

I'm trying to launch a gradio backend that uses the LLM calls from Facebook. But it tells me that GPUs are not available:

(.venv) reply@reply-GP66-Leopard-11UH:~/dev/chatbot-rag$ gradio gradio-chatbot.py 

Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/home/reply/.local/lib/python3.10/site-packages/gradio', '/home/reply/dev/chatbot-rag', '/home/reply/dev/chatbot-rag'

/home/reply/.local/lib/python3.10/site-packages/langchain/__init__.py:34: UserWarning: Importing PromptTemplate from langchain root module is no longer supported. Please use langchain.prompts.PromptTemplate instead.
  warnings.warn(
Initializing backend for chatbot
/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py:694: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Traceback (most recent call last):
  File "/home/reply/dev/chatbot-rag/gradio-chatbot.py", line 10, in <module>
    backend.load_embeddings_and_llm_models()
  File "/home/reply/dev/chatbot-rag/backend.py", line 50, in load_embeddings_and_llm_models
    self.llm = self.load_llm(self.params)
  File "/home/reply/dev/chatbot-rag/backend.py", line 66, in load_llm
    pipe = pipeline("text-generation", model=self.llm_name_or_path, model_kwargs=model_kwargs)
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 834, in pipeline
    framework, model = infer_framework_load_model(
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3222, in from_pretrained
    max_memory = get_balanced_memory(
  File "/home/reply/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 771, in get_balanced_memory
    max_memory = get_max_memory(max_memory)
  File "/home/reply/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 643, in get_max_memory
    _ = torch.tensor([0], device=i)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(.venv) reply@reply-GP66-Leopard-11UH:~/dev/chatbot-rag$ nvidia-smi 
Tue Nov  7 02:17:55 2023

It does not appear that the GPU is being over-utilized.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8              12W / 125W |    173MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2910      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A      5008      C   python3                                     158MiB |
+---------------------------------------------------------------------------------------+

Side note, I'm surprised that python3 takes up so much memory.

I have the appropriate driver

and I can load the model:

(.venv) reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> model_directory = "/home/reply/dev/Llama-2-7b-chat-hf"
>>> model = AutoModelForCausalLM.from_pretrained(model_directory)
/home/reply/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.69s/it]
>>>

I don't know if it's related, but I have to say that installing LLamaV2 was no picnic:

(.venv) reply@reply-GP66-Leopard-11UH:~/dev$ git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
Cloning into 'Llama-2-7b-chat-hf'...
Username for 'https://huggingface.co': Mine
Password for 'https://[email protected]': 
remote: Enumerating objects: 85, done.
remote: Counting objects: 100% (70/70), done.
remote: Compressing objects: 100% (70/70), done.
remote: Total 85 (delta 36), reused 0 (delta 0), pack-reused 15
Unpacking objects: 100% (85/85), 978.94 KiB | 2.11 MiB/s, done.
Username for 'https://huggingface.co': Mine
Password for 'https://[email protected]': 

^Cwarning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'


(.venv) reply@reply-GP66-Leopard-11UH:~/dev$ 
Exiting because of "interrupt" signal.
ls
chatbot-rag  codellama  faradai  llama  Llama-2-7b-chat-hf
(.venv) reply@reply-GP66-Leopard-11UH:~/dev$ cd Llama-2-7b-chat-hf/
eply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git lfs pull
Username for 'https://huggingface.co': Mine                                                                                                                                                    
Password for 'https://[email protected]': 
Error updating the Git index: (4/4), 31 GB | 3.1 MB/s                                                                                                                                                              
error: pytorch_model-00002-of-00002.bin: cannot add to the index - missing --add option?
fatal: Unable to process path pytorch_model-00002-of-00002.bin


Errors logged to '/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs/logs/20231107T015518.128081616.log'.
Use `git lfs logs last` to view the log.
reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git lfs logs last
git-lfs/3.4.0 (GitHub; linux amd64; go 1.20.6)
git version 2.34.1

$ git-lfs pull
Error updating the Git index:
error: pytorch_model-00002-of-00002.bin: cannot add to the index - missing --add option?
fatal: Unable to process path pytorch_model-00002-of-00002.bin


exit status 128

Current time in UTC:
2023-11-07 00:55:18

Environment:
LocalWorkingDir=/home/reply/dev/Llama-2-7b-chat-hf
LocalGitDir=/home/reply/dev/Llama-2-7b-chat-hf/.git
LocalGitStorageDir=/home/reply/dev/Llama-2-7b-chat-hf/.git
LocalMediaDir=/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs/objects
LocalReferenceDirs=
TempDir=/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs/tmp
ConcurrentTransfers=8
TusTransfers=false
BasicTransfersOnly=false
SkipDownloadErrors=false
FetchRecentAlways=false
FetchRecentRefsDays=7
FetchRecentCommitsDays=0
FetchRecentRefsIncludeRemotes=true
PruneOffsetDays=3
PruneVerifyRemoteAlways=false
PruneRemoteName=origin
LfsStorageDir=/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs
AccessDownload=basic
AccessUpload=basic
DownloadTransfers=basic,lfs-standalone-file,ssh
UploadTransfers=basic,lfs-standalone-file,ssh
GIT_EXEC_PATH=/usr/lib/git-core

Client IP addresses:
192.168.0.15 2a01:e0a:2c1:a2b0:659c:aaa3:90e1:2ca4 2a01:e0a:2c1:a2b0:656d:2d3b:f648:31bd fe80::5a0d:fb4e:2f89:11b9
172.17.0.1
172.18.0.1
172.19.0.1
reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git hash-object pytorch_model-00002-of-00002.bin
fbbb6037dd5ef242b0501ae05db2710c350b325c
reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git update-index --add --cacheinfo 100644,fbbb6037dd5ef242b0501ae05db2710c350b325c,pytorch_model-00002-of-00002.bin
reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git lfs pull 
(.venv) reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git restore --source=HEAD :/
Username for 'https://huggingface.co': fatal: could not read Username for 'https://huggingface.co': Success
Downloading model-00001-of-00002.safetensors (10 GB)
Username for 'https://huggingface.co': fatal: could not read Username for 'https://huggingface.co': Success
Error downloading object: model-00001-of-00002.safetensors (66dec18): Smudge error: Error downloading model-00001-of-00002.safetensors (66dec18c9f1705b9387d62f8485f4e7d871ca388718786737ed3c72dbfaac9fb): batch response: Git credentials for https://huggingface.co/meta-llama/Llama-2-7b-chat-hf not found.

Errors logged to '/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs/logs/20231106T232024.743763803.log'.
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: model-00001-of-00002.safetensors: smudge filter lfs failed

Update

Without any reason but reboots I now have two different errors, after rebooting, sometime I get:

(.venv) reply@reply-GP66-Leopard-11UH:~/dev/chatbot-rag$ gradio gradio-chatbot.py 

Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/home/reply/.local/lib/python3.10/site-packages/gradio', '/home/reply/dev/chatbot-rag', '/home/reply/dev/chatbot-rag'

/home/reply/.local/lib/python3.10/site-packages/langchain/__init__.py:34: UserWarning: Importing PromptTemplate from langchain root module is no longer supported. Please use langchain.prompts.PromptTemplate instead.
  warnings.warn(
Initializing backend for chatbot
/home/reply/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py:694: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Traceback (most recent call last):
  File "/home/reply/dev/chatbot-rag/gradio-chatbot.py", line 10, in <module>
    backend.load_embeddings_and_llm_models()
  File "/home/reply/dev/chatbot-rag/backend.py", line 50, in load_embeddings_and_llm_models
    self.llm = self.load_llm(self.params)
  File "/home/reply/dev/chatbot-rag/backend.py", line 66, in load_llm
    pipe = pipeline("text-generation", model=self.llm_name_or_path, model_kwargs=model_kwargs)
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 834, in pipeline
    framework, model = infer_framework_load_model(
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2614, in from_pretrained
    raise ImportError(
ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`

Like if the driver wasn't activated at all.

And some other time I have:

(.venv) reply@reply-GP66-Leopard-11UH:~/dev/chatbot-rag$ gradio gradio-chatbot.py 

Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/home/reply/.local/lib/python3.10/site-packages/gradio', '/home/reply/dev/chatbot-rag', '/home/reply/dev/chatbot-rag'

/home/reply/.local/lib/python3.10/site-packages/langchain/__init__.py:34: UserWarning: Importing PromptTemplate from langchain root module is no longer supported. Please use langchain.prompts.PromptTemplate instead.
  warnings.warn(
Initializing backend for chatbot
/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py:694: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Traceback (most recent call last):
  File "/home/reply/dev/chatbot-rag/gradio-chatbot.py", line 10, in <module>
    backend.load_embeddings_and_llm_models()
  File "/home/reply/dev/chatbot-rag/backend.py", line 50, in load_embeddings_and_llm_models
    self.llm = self.load_llm(self.params)
  File "/home/reply/dev/chatbot-rag/backend.py", line 66, in load_llm
    pipe = pipeline("text-generation", model=self.llm_name_or_path, model_kwargs=model_kwargs)
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 834, in pipeline
    framework, model = infer_framework_load_model(
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 282, in infer_framework_load_model
    raise ValueError(
ValueError: Could not load model /home/reply/dev/Llama-2-7b-chat-hf with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>). See the original errors:

while loading with AutoModelForCausalLM, an error is thrown:
Traceback (most recent call last):
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in from_pretrained
    raise ValueError(
ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.
                        

while loading with LlamaForCausalLM, an error is thrown:
Traceback (most recent call last):
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/home/reply/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in from_pretrained
    raise ValueError(
ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

Original Q&A

TechQA.

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable while GPU utilization is 0% according to nvidia-smi

Update

There are 0 answers

Related Questions in PYTHON

Related Questions in GPU

Related Questions in NVIDIA

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in GRADIO

Popular Questions

Popular Tags

Trending Questions