llama-cpp-python model not using nvidia gpu

1.8k views Asked by At

Trying to run the below model and it is not running using GPU and defaulting to CPU compute.

The code is run on docker image on RHEL node that has NVIDIA GPU (verified and works on other models)

Docker command:

docker run -it --rm -p 8888:8888 --runtime=nvidia --gpus all -v /users/jupyter/data:/data -v /users/jupyter/notebooks:/project/notebooks llama-gpu

Model llama-2-7b-chat.Q3_K_L.gguf

Example

`

!export FORCE_CMAKE=1

!export CMAKE_ARGS="-DLLAMA_CUBLAS=on"

!export LLAMA_CPP_LIB=/azureml-envs/tensorflow-2.12-cuda11/lib/python3.8/site-packages/llama_cpp_cuda/libllama.so

pip install llama-cpp-python

`

`

from llama_cpp import Llama

def question_generator(context):

prompt = """[INST] <<SYS>>
    You are a helpful, respectful and honest assistant.
    Always respond as helpfully as possible, while being safe.
    Please ensure you generate the question based on the given context only
    <</SYS>>
    generate 3 questions based on the given content:-{}.
    """.format(context)


llm = Llama(
    model_path="llama-2-7b-chat.Q3_K_L.gguf",
    n_ctx=8192,
    n_batch=512,
    use_mlock=True,
    n_gpu_layers=248,
    n_threads=8
)


output = llm(prompt,
           max_tokens=-1,
           echo=False,
           temperature=0.2,
           top_p=0.1)

return output['choices'][0]['text']

`

`

df["questions"]=""

for i in range(len(df)):
    df["questions"].iloc[i]=question_generator(df["text"].iloc[i])

`

Tried below changes from other suggestions. It still doesn't use GPU compute

`
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

`

0

There are 0 answers