Context
I pre-trained the mistral Mistral-7B-v0.1
(base model) using the pretrain_chinese_llama_lora.ipynb
script provided in the Chinese-LLaMA-Alpaca repository on the Github.
I trained the base model for text completion task using on 50 lines of text containing facts - places, persons, historical & geographical facts.
The lines representing the facts about a single entity (like place, a person,....) are not continuous. For example:
<fact #1 about New York>
<fact #1 about John Doe>
<fact #2 about John Doe>
<fact #1 about a river and geography>
<fact #2 about New York>
...
...
...
<fact #3 about New York>
Now my goal is to retrieve all the facts about New York
using text completion prompt after pre-training the model for text completion task.
My Observation
I see that even after using the Diverse beam search decoding, the model is not able to retrieve all the facts/context related to New York.
What did I try?
The snippet for the inference is as follows:
with torch.no_grad():
outputs = pt_model.generate(**model_input, max_new_tokens=100, repetition_penalty=1.15,
num_beams=15, num_beam_groups=15, diversity_penalty=2.0,
num_return_sequences=15)
model_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
Why did I didn't use database / RAG:
- I have a 1000s of PDFs of data which I can't obviously go through and create DB scripts to store the facts for the entities I am interested into.
- RAG will possibly induce similar (not exact) facts based on similarity search which I want to avoid.
- I want to further fine tune this model for Q&A on my specific use-case. Hence I need to retrieve as much as facts as possible.