I'm trying to launch a gradio backend that uses the LLM calls from Facebook. But it tells me that GPUs are not available:
(.venv) reply@reply-GP66-Leopard-11UH:~/dev/chatbot-rag$ gradio gradio-chatbot.py
Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/home/reply/.local/lib/python3.10/site-packages/gradio', '/home/reply/dev/chatbot-rag', '/home/reply/dev/chatbot-rag'
/home/reply/.local/lib/python3.10/site-packages/langchain/__init__.py:34: UserWarning: Importing PromptTemplate from langchain root module is no longer supported. Please use langchain.prompts.PromptTemplate instead.
warnings.warn(
Initializing backend for chatbot
/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py:694: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Traceback (most recent call last):
File "/home/reply/dev/chatbot-rag/gradio-chatbot.py", line 10, in <module>
backend.load_embeddings_and_llm_models()
File "/home/reply/dev/chatbot-rag/backend.py", line 50, in load_embeddings_and_llm_models
self.llm = self.load_llm(self.params)
File "/home/reply/dev/chatbot-rag/backend.py", line 66, in load_llm
pipe = pipeline("text-generation", model=self.llm_name_or_path, model_kwargs=model_kwargs)
File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 834, in pipeline
framework, model = infer_framework_load_model(
File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
model = model_class.from_pretrained(model, **kwargs)
File "/home/reply/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
return model_class.from_pretrained(
File "/home/reply/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3222, in from_pretrained
max_memory = get_balanced_memory(
File "/home/reply/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 771, in get_balanced_memory
max_memory = get_max_memory(max_memory)
File "/home/reply/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 643, in get_max_memory
_ = torch.tensor([0], device=i)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(.venv) reply@reply-GP66-Leopard-11UH:~/dev/chatbot-rag$ nvidia-smi
Tue Nov 7 02:17:55 2023
It does not appear that the GPU is being over-utilized.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 12W / 125W | 173MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2910 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 5008 C python3 158MiB |
+---------------------------------------------------------------------------------------+
Side note, I'm surprised that python3 takes up so much memory.
I have the appropriate driver
and I can load the model:
(.venv) reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> model_directory = "/home/reply/dev/Llama-2-7b-chat-hf"
>>> model = AutoModelForCausalLM.from_pretrained(model_directory)
/home/reply/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.69s/it]
>>>
I don't know if it's related, but I have to say that installing LLamaV2 was no picnic:
(.venv) reply@reply-GP66-Leopard-11UH:~/dev$ git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
Cloning into 'Llama-2-7b-chat-hf'...
Username for 'https://huggingface.co': Mine
Password for 'https://[email protected]':
remote: Enumerating objects: 85, done.
remote: Counting objects: 100% (70/70), done.
remote: Compressing objects: 100% (70/70), done.
remote: Total 85 (delta 36), reused 0 (delta 0), pack-reused 15
Unpacking objects: 100% (85/85), 978.94 KiB | 2.11 MiB/s, done.
Username for 'https://huggingface.co': Mine
Password for 'https://[email protected]':
^Cwarning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'
(.venv) reply@reply-GP66-Leopard-11UH:~/dev$
Exiting because of "interrupt" signal.
ls
chatbot-rag codellama faradai llama Llama-2-7b-chat-hf
(.venv) reply@reply-GP66-Leopard-11UH:~/dev$ cd Llama-2-7b-chat-hf/
eply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git lfs pull
Username for 'https://huggingface.co': Mine
Password for 'https://[email protected]':
Error updating the Git index: (4/4), 31 GB | 3.1 MB/s
error: pytorch_model-00002-of-00002.bin: cannot add to the index - missing --add option?
fatal: Unable to process path pytorch_model-00002-of-00002.bin
Errors logged to '/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs/logs/20231107T015518.128081616.log'.
Use `git lfs logs last` to view the log.
reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git lfs logs last
git-lfs/3.4.0 (GitHub; linux amd64; go 1.20.6)
git version 2.34.1
$ git-lfs pull
Error updating the Git index:
error: pytorch_model-00002-of-00002.bin: cannot add to the index - missing --add option?
fatal: Unable to process path pytorch_model-00002-of-00002.bin
exit status 128
Current time in UTC:
2023-11-07 00:55:18
Environment:
LocalWorkingDir=/home/reply/dev/Llama-2-7b-chat-hf
LocalGitDir=/home/reply/dev/Llama-2-7b-chat-hf/.git
LocalGitStorageDir=/home/reply/dev/Llama-2-7b-chat-hf/.git
LocalMediaDir=/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs/objects
LocalReferenceDirs=
TempDir=/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs/tmp
ConcurrentTransfers=8
TusTransfers=false
BasicTransfersOnly=false
SkipDownloadErrors=false
FetchRecentAlways=false
FetchRecentRefsDays=7
FetchRecentCommitsDays=0
FetchRecentRefsIncludeRemotes=true
PruneOffsetDays=3
PruneVerifyRemoteAlways=false
PruneRemoteName=origin
LfsStorageDir=/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs
AccessDownload=basic
AccessUpload=basic
DownloadTransfers=basic,lfs-standalone-file,ssh
UploadTransfers=basic,lfs-standalone-file,ssh
GIT_EXEC_PATH=/usr/lib/git-core
Client IP addresses:
192.168.0.15 2a01:e0a:2c1:a2b0:659c:aaa3:90e1:2ca4 2a01:e0a:2c1:a2b0:656d:2d3b:f648:31bd fe80::5a0d:fb4e:2f89:11b9
172.17.0.1
172.18.0.1
172.19.0.1
reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git hash-object pytorch_model-00002-of-00002.bin
fbbb6037dd5ef242b0501ae05db2710c350b325c
reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git update-index --add --cacheinfo 100644,fbbb6037dd5ef242b0501ae05db2710c350b325c,pytorch_model-00002-of-00002.bin
reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git lfs pull
(.venv) reply@reply-GP66-Leopard-11UH:~/dev/Llama-2-7b-chat-hf$ git restore --source=HEAD :/
Username for 'https://huggingface.co': fatal: could not read Username for 'https://huggingface.co': Success
Downloading model-00001-of-00002.safetensors (10 GB)
Username for 'https://huggingface.co': fatal: could not read Username for 'https://huggingface.co': Success
Error downloading object: model-00001-of-00002.safetensors (66dec18): Smudge error: Error downloading model-00001-of-00002.safetensors (66dec18c9f1705b9387d62f8485f4e7d871ca388718786737ed3c72dbfaac9fb): batch response: Git credentials for https://huggingface.co/meta-llama/Llama-2-7b-chat-hf not found.
Errors logged to '/home/reply/dev/Llama-2-7b-chat-hf/.git/lfs/logs/20231106T232024.743763803.log'.
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: model-00001-of-00002.safetensors: smudge filter lfs failed
Update
Without any reason but reboots I now have two different errors, after rebooting, sometime I get:
(.venv) reply@reply-GP66-Leopard-11UH:~/dev/chatbot-rag$ gradio gradio-chatbot.py
Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/home/reply/.local/lib/python3.10/site-packages/gradio', '/home/reply/dev/chatbot-rag', '/home/reply/dev/chatbot-rag'
/home/reply/.local/lib/python3.10/site-packages/langchain/__init__.py:34: UserWarning: Importing PromptTemplate from langchain root module is no longer supported. Please use langchain.prompts.PromptTemplate instead.
warnings.warn(
Initializing backend for chatbot
/home/reply/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py:694: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Traceback (most recent call last):
File "/home/reply/dev/chatbot-rag/gradio-chatbot.py", line 10, in <module>
backend.load_embeddings_and_llm_models()
File "/home/reply/dev/chatbot-rag/backend.py", line 50, in load_embeddings_and_llm_models
self.llm = self.load_llm(self.params)
File "/home/reply/dev/chatbot-rag/backend.py", line 66, in load_llm
pipe = pipeline("text-generation", model=self.llm_name_or_path, model_kwargs=model_kwargs)
File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 834, in pipeline
framework, model = infer_framework_load_model(
File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
model = model_class.from_pretrained(model, **kwargs)
File "/home/reply/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
return model_class.from_pretrained(
File "/home/reply/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2614, in from_pretrained
raise ImportError(
ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`
Like if the driver wasn't activated at all.
And some other time I have:
(.venv) reply@reply-GP66-Leopard-11UH:~/dev/chatbot-rag$ gradio gradio-chatbot.py
Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/home/reply/.local/lib/python3.10/site-packages/gradio', '/home/reply/dev/chatbot-rag', '/home/reply/dev/chatbot-rag'
/home/reply/.local/lib/python3.10/site-packages/langchain/__init__.py:34: UserWarning: Importing PromptTemplate from langchain root module is no longer supported. Please use langchain.prompts.PromptTemplate instead.
warnings.warn(
Initializing backend for chatbot
/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py:694: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Traceback (most recent call last):
File "/home/reply/dev/chatbot-rag/gradio-chatbot.py", line 10, in <module>
backend.load_embeddings_and_llm_models()
File "/home/reply/dev/chatbot-rag/backend.py", line 50, in load_embeddings_and_llm_models
self.llm = self.load_llm(self.params)
File "/home/reply/dev/chatbot-rag/backend.py", line 66, in load_llm
pipe = pipeline("text-generation", model=self.llm_name_or_path, model_kwargs=model_kwargs)
File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 834, in pipeline
framework, model = infer_framework_load_model(
File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 282, in infer_framework_load_model
raise ValueError(
ValueError: Could not load model /home/reply/dev/Llama-2-7b-chat-hf with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>). See the original errors:
while loading with AutoModelForCausalLM, an error is thrown:
Traceback (most recent call last):
File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
model = model_class.from_pretrained(model, **kwargs)
File "/home/reply/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
return model_class.from_pretrained(
File "/home/reply/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
`device_map` to `from_pretrained`. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.
while loading with LlamaForCausalLM, an error is thrown:
Traceback (most recent call last):
File "/home/reply/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
model = model_class.from_pretrained(model, **kwargs)
File "/home/reply/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
`device_map` to `from_pretrained`. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.