LLM model is not loading into the GPU even after BLAS = 1, LlamaCpp, Langchain, Mistral 7b GGUF Model

846 views Asked by At

Confession: At first, I am not an expert at all in this sector; I am just practicing and trying to learn while working. Also, I am confused about whether this kind of model does not run on this type of GPU or not.

I am trying to run a model locally on my laptop (for now, I have only this machine). I have downloaded the model from Hungging Face The Bloke.

Intension: I am using Langchain, where I will upload some data and have a conversation with the model (roughly, this is the idea, and unfortunately, I cannot express more because of privacy).

Worked So Far: I have used at first llama-cpp-python (CPU) library and attempted to run the model, and it worked. But as predicted, the inference was so slow that it took nearly 2 minutes to answer one question.

Then I tried to build with cuBLAS using the command below:

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade llama-cpp-python

It worked, and after running the program, I noticed BLAS = 1 (previously, in CPU version, it was BLAS = 0).

Problem: After running the entire program, I noticed that while I was uploading the data that I wanted to perform the conversation with, the model was not getting loaded onto my GPU, and I got it after looking at Nvidia X Server, where it showed that my GPU memory was not consumed at all, even though in the terminal it was showing that BLAS = 1, and I got the idea that it does not indicate that the model is loaded onto the GPU. Now, I am not sure what to do at this point. I searched the internet but did not get any proper fixes.

Some Additional Problems: I tried setting n_batch = 256 instead of the default value of 512 to reduce strain on my GPU, but I got the error ValueError: Requested tokens exceeded context window... So, I was wondering how to use the tradeoff between the gpu layers, context window, and batch size? In the documentation of LlamaCpp GPU, it is written like below: enter image description here

The Code Snippets of My Project Where I Actually Changed The Model:

language_model = LlamaCpp(
    model_path="/my/model/path/directory/sub_directory/mistral_7b_v_1/mistral-7b-v0.1.Q2_K.gguf",
    n_gpu_layers=1,
    n_batch=64,
    n_ctx=256,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=True
)
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)
return ConversationalRetrievalChain.from_llm(
    llm=language_model,
    retriever=vectorstore.as_retriever(search_type="mmr", search_kwargs = {"k": 5}),
    memory=memory
)

Hardware Details:

  1. GPU: NVIDIA GeForce RTX 3050 Laptop GPU / AMD Renoir
  2. GPU VRAM: 4 GB (3.8 GB usable)
  3. CPU: AMD® Ryzen 9 5900hx with radeon graphics × 16
  4. Machine RAM: 16 GB
  5. Model Max RAM Required: 5.58 (Is this the main reason of not running?)

Lastly: Thank you for reading this long post. I look forward to some answers, if you may. :)

0

There are 0 answers