I enabled llama_cublas to work with nvidia cuda toolkit
make LLAMA_CUBLAS=1
It compiled fine
But when I run a model, and monitor nvidia-smi memory consumption, only 75mb get's used. See below.
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 13189.99 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/43 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 400.00 MB
llama_new_context_with_model: compute buffer total size = 81.13 MB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MB
llama_new_context_with_model: total VRAM used: 75.00 MB (model: 0.00 MB, context: 75.00 MB)
nvidia smi output
Tue Oct 24 10:53:17 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4050 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 42C P8 5W / 30W | 89MiB / 6141MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1991 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
When you run
main
executable, you need to set-ngl 35
. Depending on your card that35
can be higher or lover. It indicates how many layers should be offloaded to GPU.Example:
./main -ngl 35 -m dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"