I am using Langchain with llama-2-13B. I have set up the llama2 on an AWS machine with 240GB RAM and 4x16GB Tesla V100 GPUs. It takes around 20s to make an inference. I want to make it faster, reaching around 8-10s, to make it real-time. And the output is very poor. If I ask a query, "Hi, How are you?" It will generate a 500-word paragraph. How can I improve the output results? I am currently using this configuration:
LlamaCpp(model_path= path,
temperature=0.7,
max_tokens=800,
top_p=0.1,
top_k=40,
n_threads=4,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=True,
n_ctx=2000,
n_gpu_layers=80,
n_batch=2048)
I would start by using the llama-2-13B-**chat**** instead of llama-2-13B.
Chat models are optimized for dialogue use cases, and the ones without the chat suffix are trained for predicting the next token. Hence, by generating a 500-word paragraph your model is doing exactly what it's supposed to do.
Also, prompting is essential for LLaMa models. You can use Beginning of Sequence (BOS) and End of Sequence (EOS) tokens, which would look something like this: