langchain with llama2 local slow inference

175 views Asked by At

I am using Langchain with llama-2-13B. I have set up the llama2 on an AWS machine with 240GB RAM and 4x16GB Tesla V100 GPUs. It takes around 20s to make an inference. I want to make it faster, reaching around 8-10s, to make it real-time. And the output is very poor. If I ask a query, "Hi, How are you?" It will generate a 500-word paragraph. How can I improve the output results? I am currently using this configuration:

LlamaCpp(model_path= path,
                temperature=0.7,
                max_tokens=800,
                top_p=0.1,
                top_k=40,
                n_threads=4,
                callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
                verbose=True,
                n_ctx=2000,
                n_gpu_layers=80,
                n_batch=2048)
1

There are 1 answers

0
Kami On

I would start by using the llama-2-13B-**chat**** instead of llama-2-13B.

Chat models are optimized for dialogue use cases, and the ones without the chat suffix are trained for predicting the next token. Hence, by generating a 500-word paragraph your model is doing exactly what it's supposed to do.

Also, prompting is essential for LLaMa models. You can use Beginning of Sequence (BOS) and End of Sequence (EOS) tokens, which would look something like this:

template = """
    [INST] <<SYS>>
    You are a helpful, respectful and honest assistant. 
    Always answer as helpfully as possible, while being safe.  
    Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
    Please ensure that your responses are socially unbiased and positive in nature.
    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
    If you don't know the answer to a question, please don't share false information.
    <</SYS>>
    {INSERT_PROMPT_HERE} [/INST]
    """

prompt = 'Your actual question to the model'
prompt = template.replace('INSERT_PROMPT_HERE', prompt)