Context:
I have my data in multiple .txt
files and my LLM (Mistral-7B-v0.1
) needs to be trained on these files for text completion purpose.
Use case:
The issue on my side is that, I want to train the LLM in an incremental fashion because I don't have all my files right at the moment. I will have access some of them after sometime.
What will I try?
Consider that I start from a base model and its base tokenizer.
Now before I train my model on the first .txt
file, I will train my tokenizer to cover the vocabulary from the first .txt
file. Lets name it as tokenizer-1
Now, using the new tokenizer, I trained my base model and saved it as checkpoint-1
.
Now, before training the LLM on the second .txt
file, I will again train the tokenizer-1
to cover the vocabulary from the second .txt
file. Lets name it as tokenizer-2
. And now, I will train the checkpoint-1
further on the second .txt
file to get a new model. Lets name it as checkpoint-2
.
My Question:
Since the checkpoint-1
was trained using tokenizer-1
, and since now I am training the checkpoint-1
further on tokenizer-2
, doesn't the model weights of checkpoint-1
get irrelevant w.r.t tokenizer-2
?
PS:
I am deliberately adding llama
as keyword below to reach out the wider community and this question is applicable to any LLM in general.