llama.cpp llama_cublas enabled, but only 75mb/6gb of vram used when running ./main

102 views Asked by At

I enabled llama_cublas to work with nvidia cuda toolkit

make LLAMA_CUBLAS=1

It compiled fine

But when I run a model, and monitor nvidia-smi memory consumption, only 75mb get's used. See below.

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 13189.99 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/43 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size = 81.13 MB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MB
llama_new_context_with_model: total VRAM used: 75.00 MB (model: 0.00 MB, context: 75.00 MB)

nvidia smi output

Tue Oct 24 10:53:17 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4050 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P8               5W /  30W |      89MiB /  6141MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1991      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
1

There are 1 answers

0
mtasic85 On

When you run main executable, you need to set -ngl 35. Depending on your card that 35 can be higher or lover. It indicates how many layers should be offloaded to GPU.

Example: ./main -ngl 35 -m dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"