I am attempting to train a Longformer model from scratch using this script provided by HuggingFace. When using a pretrained tokenizer, such as the one from longformer-base-4096, I do not run into any issues and I can train the model. When using a tokenizer I train myself, I run into issues with the provided script. On CPU, I get the error
...
File "python3.10/site-packages/transformers/models/longformer/modeling_longformer.py", line 471, in forward
position_embeddings = self.position_embeddings(position_ids)
File "python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
return F.embedding(
File "python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
And when running with GPU, I encounter CUDA assertion errors
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [165,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [165,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [165,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [165,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
File "python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "python3.10/site-packages/torch/nn/modules/normalization.py", line 201, in forward
return F.layer_norm(
File "python3.10/site-packages/torch/nn/functional.py", line 2546, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA error: device-side assert triggered
I launch the script with the provided arguments (with irrelevant ones removed)
python run_mlm.py \
--model_type longformer \
--tokenizer tokenizer \
--pad_to_max_length \
--line_by_line \
--max_seq_length 1024
...
The script I use to train the tokenizer is relatively simple. I was previously using Roberta instead of Longformer and had no issues with the script. The input file is in JSONL format where each line has a 'text' field, which contains the text to tokenize.
from tokenizers import ByteLevelBPETokenizer
from transformers import LongformerTokenizerFast
import json
import os
def load_data(filepath, count):
with open(filepath, 'r') as f:
for i in range(count):
data = json.loads(f.readline())
yield data['text']
input_file = 'train.txt'
count = 100_000
vocab_size = 50260
min_freq = 2
output = 'tokenizer'
data = load_data(input, count)
special_tokens = ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']
tokenizer = ByteLevelBPETokenizer(
lowercase=False
)
tokenizer.train_from_iterator(
data,
vocab_size = args.vocab_size,
min_frequency = args.min_freq,
show_progress = True,
special_tokens = special_tokens
)
os.makedirs(output, exist_ok=True)
tokenizer.save(os.path.join(output, 'tokenizer.json'))
tokenizer = LongformerTokenizerFast(
tokenizer_file=os.path.join(output, 'tokenizer.json'),
)
tokenizer.save_pretrained(output)
I have tried modifying the vocab size of the tokenizer to different values. Initially it was at 30,000, but I have modified it to the size of the longformer-base-4096 tokenizer to no success. I have also messed with the max_seq_length field when running the train script, with 512, 514, 1022, and 1024 as the maximum sequence length, but that did not work either. If I do print(model.config.vocab_size, len(tokenizer)), I see that the model config length is the same as the tokenizer, being both 50265 in my case.
I am not sure where this error may be arising from my custom tokenizer. Any help would be appreciated. For reference, here is the last couple lines of tokenizer_config.json
{
...
"50264": {
"content": "<mask>",
"lstrip": true,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<s>",
"clean_up_tokenization_spaces": true,
"cls_token": "<s>",
"eos_token": "</s>",
"errors": "replace",
"mask_token": "<mask>",
"model_max_length": 1000000000000000019884624838656,
"pad_token": "<pad>",
"sep_token": "</s>",
"tokenizer_class": "LongformerTokenizer",
"trim_offsets": true,
"unk_token": "<unk>"
}