Incremental training of a large language model

253 views Asked by At

Context:

I have my data in multiple .txt files and my LLM (Mistral-7B-v0.1) needs to be trained on these files for text completion purpose.

Use case:

The issue on my side is that, I want to train the LLM in an incremental fashion because I don't have all my files right at the moment. I will have access some of them after sometime.

What will I try?

Consider that I start from a base model and its base tokenizer.

Now before I train my model on the first .txt file, I will train my tokenizer to cover the vocabulary from the first .txt file. Lets name it as tokenizer-1 Now, using the new tokenizer, I trained my base model and saved it as checkpoint-1.

Now, before training the LLM on the second .txt file, I will again train the tokenizer-1 to cover the vocabulary from the second .txt file. Lets name it as tokenizer-2. And now, I will train the checkpoint-1 further on the second .txt file to get a new model. Lets name it as checkpoint-2.

My Question:

Since the checkpoint-1 was trained using tokenizer-1, and since now I am training the checkpoint-1 further on tokenizer-2, doesn't the model weights of checkpoint-1 get irrelevant w.r.t tokenizer-2?

PS:

I am deliberately adding llama as keyword below to reach out the wider community and this question is applicable to any LLM in general.

0

There are 0 answers