Confession: At first, I am not an expert at all in this sector; I am just practicing and trying to learn while working. Also, I am confused about whether this kind of model does not run on this type of GPU or not.
I am trying to run a model locally on my laptop (for now, I have only this machine). I have downloaded the model from Hungging Face The Bloke.
Intension: I am using Langchain, where I will upload some data and have a conversation with the model (roughly, this is the idea, and unfortunately, I cannot express more because of privacy).
Worked So Far: I have used at first llama-cpp-python (CPU) library and attempted to run the model, and it worked. But as predicted, the inference was so slow that it took nearly 2 minutes to answer one question.
Then I tried to build with cuBLAS using the command below:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade llama-cpp-python
It worked, and after running the program, I noticed BLAS = 1 (previously, in CPU version, it was BLAS = 0).
Problem: After running the entire program, I noticed that while I was uploading the data that I wanted to perform the conversation with, the model was not getting loaded onto my GPU, and I got it after looking at Nvidia X Server, where it showed that my GPU memory was not consumed at all, even though in the terminal it was showing that BLAS = 1, and I got the idea that it does not indicate that the model is loaded onto the GPU. Now, I am not sure what to do at this point. I searched the internet but did not get any proper fixes.
Some Additional Problems: I tried setting n_batch = 256 instead of the default value of 512 to reduce strain on my GPU, but I got the error ValueError: Requested tokens exceeded context window... So, I was wondering how to use the tradeoff between the gpu layers, context window, and batch size? In the documentation of LlamaCpp GPU, it is written like below:
The Code Snippets of My Project Where I Actually Changed The Model:
language_model = LlamaCpp(
model_path="/my/model/path/directory/sub_directory/mistral_7b_v_1/mistral-7b-v0.1.Q2_K.gguf",
n_gpu_layers=1,
n_batch=64,
n_ctx=256,
f16_kv=True,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=True
)
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)
return ConversationalRetrievalChain.from_llm(
llm=language_model,
retriever=vectorstore.as_retriever(search_type="mmr", search_kwargs = {"k": 5}),
memory=memory
)
Hardware Details:
- GPU: NVIDIA GeForce RTX 3050 Laptop GPU / AMD Renoir
- GPU VRAM: 4 GB (3.8 GB usable)
- CPU: AMD® Ryzen 9 5900hx with radeon graphics × 16
- Machine RAM: 16 GB
- Model Max RAM Required: 5.58 (Is this the main reason of not running?)
Lastly: Thank you for reading this long post. I look forward to some answers, if you may. :)