llama-index: multiple calls to query_engine.query always gives "Empty Response"

657 views Asked by At

I have the following code that works as expected

model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
llm = LlamaCPP(model_url=model_url,temperature=0.1,max_new_tokens=256,context_window=3900,generate_kwargs={},model_kwargs={"n_gpu_layers": 1},messages_to_prompt=messages_to_prompt,completion_to_prompt=completion_to_prompt,verbose=True,)
reader = SimpleDirectoryReader(input_files=['test_invoice.pdf'])
documents = reader.load_data()
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
service_context = ServiceContext.from_defaults(llm=llm,embed_model=embed_model,)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()
response = query_engine.query("How much is the Recomended Amount?")
print(response)

However, when I make a second all to query_engine.query, it always fails with "Empty Response". Even if it is the same question

response = query_engine.query("How much is the Recomended Amount?")
print(response)
response = query_engine.query("How much is the Recomended Amount?")
print(response)

Is there something I am missing? Is it a bug in llama-cpp? Thanks in advance.

0

There are 0 answers