IndexError when training longformer model from scratch with custom tokenizer

32 views Asked by At

I am attempting to train a Longformer model from scratch using this script provided by HuggingFace. When using a pretrained tokenizer, such as the one from longformer-base-4096, I do not run into any issues and I can train the model. When using a tokenizer I train myself, I run into issues with the provided script. On CPU, I get the error

  ...
  File "python3.10/site-packages/transformers/models/longformer/modeling_longformer.py", line 471, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
  File "python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

And when running with GPU, I encounter CUDA assertion errors

../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [165,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [165,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [165,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [165,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/normalization.py", line 201, in forward
    return F.layer_norm(
  File "python3.10/site-packages/torch/nn/functional.py", line 2546, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA error: device-side assert triggered

I launch the script with the provided arguments (with irrelevant ones removed)

python run_mlm.py \
    --model_type longformer \
    --tokenizer tokenizer \
    --pad_to_max_length \
    --line_by_line \
    --max_seq_length 1024
    ...

The script I use to train the tokenizer is relatively simple. I was previously using Roberta instead of Longformer and had no issues with the script. The input file is in JSONL format where each line has a 'text' field, which contains the text to tokenize.

from tokenizers import ByteLevelBPETokenizer
from transformers import LongformerTokenizerFast
import json
import os

def load_data(filepath, count):
    with open(filepath, 'r') as f:
        for i in range(count):
            data = json.loads(f.readline())
            yield data['text']

input_file = 'train.txt'
count = 100_000
vocab_size = 50260
min_freq = 2
output = 'tokenizer'

data = load_data(input, count)

special_tokens = ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']

tokenizer = ByteLevelBPETokenizer(
    lowercase=False
)

tokenizer.train_from_iterator(
    data,
    vocab_size = args.vocab_size,
    min_frequency = args.min_freq,
    show_progress = True,
    special_tokens = special_tokens
)

os.makedirs(output, exist_ok=True)
tokenizer.save(os.path.join(output, 'tokenizer.json'))

tokenizer = LongformerTokenizerFast(
    tokenizer_file=os.path.join(output, 'tokenizer.json'),
)

tokenizer.save_pretrained(output)

I have tried modifying the vocab size of the tokenizer to different values. Initially it was at 30,000, but I have modified it to the size of the longformer-base-4096 tokenizer to no success. I have also messed with the max_seq_length field when running the train script, with 512, 514, 1022, and 1024 as the maximum sequence length, but that did not work either. If I do print(model.config.vocab_size, len(tokenizer)), I see that the model config length is the same as the tokenizer, being both 50265 in my case.

I am not sure where this error may be arising from my custom tokenizer. Any help would be appreciated. For reference, here is the last couple lines of tokenizer_config.json

{
  ...
    "50264": {
      "content": "<mask>",
      "lstrip": true,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": true,
  "cls_token": "<s>",
  "eos_token": "</s>",
  "errors": "replace",
  "mask_token": "<mask>",
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": "<pad>",
  "sep_token": "</s>",
  "tokenizer_class": "LongformerTokenizer",
  "trim_offsets": true,
  "unk_token": "<unk>"
}

0

There are 0 answers