I'm trying to fine-tune mistralai/Mistral-7B-v0.1
using the following sample notebook
I follow the steps in the notebook, but the training fails with:
***** Running training *****
Num examples = 344
Num Epochs = 3
Instantaneous batch size per device = 2
Total train batch size (w. parallel, distributed & accumulation) = 2
Gradient Accumulation steps = 1
Total optimization steps = 500
Number of trainable parameters = 21,260,288
0%| | 0/500 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 293, in forward
raise ValueError(
ValueError: Attention mask should be of size (2, 1, 512, 1024), but is torch.Size([2, 1, 512, 512])
Any ideas where this issue regarding the extension mask could result from? My tokenized data is exactly of size 512. Why is it expecting size 1024 and these particular 4 dimensions?
Experiencing the same issue, downgrading transformers to 4.35.2 instead of latest version 4.36.0 seems to work fine.