I am using Langchain with codellama using Llama.cpp. (huggingface - TheBloke/CodeLlama-34B-Instruct-GPTQ) I have 4 Testla T4 in my device. I have installed the Llama.cpp with OpenBLAS. When I load the model with hgguf file, I could see the parameter BLAS=1 and I could see the gpu memory utilization with nvdia-smi, it's increasing while I was loading the model. When I try to generate with codellama using Llama(), It generated well.
But I try to use PromptTemplate and LLMChain, It fails, the model is not generating meaningful results, It just generates many \n characters as output. I don't understand why. While It is working, I could the gpu utilization, so it uses my tesla GPUs.
I am using <> token to give some additional information to the LLM.
My code is like below:
%set_env TEMPERATURE=0.5
%set_env GPU_LAYERS=100
%set_env MODEL_PATH=../../llm-models/codellama-34b-instruct.Q4_K_M.gguf
%set_env MODEL_N_CTX=4096
%set_env TOP_P=0.95
%set_env TOP_K=40
%set_env THREADS=8
%set_env EMBEDDINGS_MODEL_NAME=all-mpnet-base-v2
from langchain.llms import LlamaCpp
stop = ['Human:', 'Assistant:', 'User:']
llm = LlamaCpp(model_path=model_path,
n_ctx=model_n_ctx,
verbose=True,
n_threads=threads,
n_gpu_layers=gpu_layers,
n_batch=int(model_n_ctx)/8,
stop=stop,
temperature=temperature,
top_p=top_p,
top_k=top_k,
use_mlock=False,
max_tokens=2000,
)
template = f"""<s>[INST] <<SYS>>
{custom_initial_prompt}""" + """
<</SYS>>
problem description:
{text}
code:
{code}[/INST]"""
prompt_template = PromptTemplate(template=template, input_variables=["text", "code"])
chain = LLMChain(llm=llm, prompt=prompt_template, verbose=True)
%%time
result = chain.run(code = script[0].page_content, text = pages[0].page_content)
The result of this prompt is like below.
Sometimes, it generates a lot of newline characters. How can I solve this problem ?
I increase the number of GPUs. I tried the small version of the codellama model (7B Model). I tried different cuda version. I tried to load the small model into my local computer, it works well.