Fine-tuning pretrained LLM using HuggingFace transformers throws "index out of range in self"

771 views Asked by At

I am totally new to ML and learning as I go for a work project, where we are attempting to fine-tune a pretrained LLM using the company's data, which consists of magazine articles, podcast transcripts, and discussion threads. Our goal is to create a useful, custom chatbot for our online community.

It is my understanding that the HuggingFace transformers load_dataset function can use rather unstructured plaintext, as opposed to requiring the text to be structured within a JSON object or JSONL file; however, when I attempt to pass in data of this type, I am getting the generic error, "index out of range in self".

Below is a reduced version of the code, which runs successfully up until the trainer.train() line is executed, but it throws the error rather quickly after about 10 seconds.

base_model = "tiiuae/falcon-7b"  # I have tried numerous models, like mpt_7b, distilbert_base_uncased, and moe but always get the same error.
number_of_threads = 4

tokenizer = AutoTokenizer.from_pretrained(base_model, cache_dir=hugging_face_cache_dir)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': padding_token})

train_dataset = load_dataset('text', data_files={'train': '/path/to/my/train/files',
    'test': '/path/to/my/test/files'},
    cache_dir=hugging_face_cache_dir, sample_by="paragraph")
tokenized_train_dataset = train_dataset.map(
    lambda examples: tokenizer(examples\["text"\], padding="max_length",
    truncation=True, return_tensors="np"),
    batched=True, num_proc=number_of_threads)

val_dataset = load_dataset('text', data_files={'validation': val_split_filename},
    cache_dir=hugging_face_cache_dir, sample_by="paragraph")
tokenized_val_dataset = val_dataset.map(
    lambda examples: tokenizer(examples\["text"\], padding="max_length",
    truncation=True, return_tensors="np"),
    batched=True, num_proc=number_of_threads)

train_dataset = tokenized_train_dataset\['train'\].shuffle(seed=42)
eval_dataset = tokenized_val_dataset\['validation'\]
model = AutoModel.from_pretrained(base_model,
    trust_remote_code=True,
    cache_dir=hugging_face_cache_dir)
training_args = TrainingArguments(
    output_dir=FileMgr.checkpoint_batch_dir,
    evaluation_strategy=IntervalStrategy.EPOCH,
    save_strategy=IntervalStrategy.EPOCH,
    num_train_epochs=3,
    save_total_limit=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir=FileMgr.checkpoint_batch_dir,
    eval_steps=500,
    load_best_model_at_end=True,
    save_steps=500,
    remove_unused_columns=True
)
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)
trainer.train()

Here is an example of what our txt file content looks like:

Some data on the first line.
Some data on the second line.
And this continues on and on.
We have tried putting entire magazine articles on this line, replacing newlines with [SEP].
We've also tried ensuring lines don't exceed the max seq length of a model, as explained below.

It should maybe be noted that I have my cache system pointing to a directory off of the C drive (Windows!), but I am running PyCharm as an administrator and appear to not be having any issues reading/writing files.

Side questions: is it fine to have an entire article on one line even if it exceeds the model's sequence length? And if so, should I set sample_by to "document" instead of "paragraph"? Or would that be more for like reading a bunch of individually-relevant files, not a conglomerate of articles as I am creating?

Initially, I read that each line could be very long, such as an entire magazine article on each line of the .txt file, an entire transcript on each line, etc., and so I replaced each newline character with "[SEP]", and then accounted for this special token as below.

if tokenizer.sep_token is None:     
    tokenizer.add_special_tokens({'sep_token': '[SEP]'})

But then I read about the "index out of range in self" error having to do with the training inputs being too long, and so I came up with a process of first harvesting the data "as is", and then for every unique maximum sequence length for the models we are trying to experiment with, I create/cache a new batch as necessary to ensure each line is less than the maximum token length.

To ensure that I was not exceeding the maximum token length and determine if this was the issue, I did a test where each line is only 1024 CHARACTERS to ensure it was far less than the actual sequence lengths of 512/2048/etc. TOKENS; however, after doing this I am still getting the same error.

I have also tried with and without the last line being blank to ensure the out of bounds error was not related, but it is not working.

I have done large tests using our entire dataset, which is about 2.15 GB of data spread out over 53 files, where each one is 7MB to 50MB, and when we account for each line not exceeding the sequence length ends up being hundreds of thousands of training inputs. Same error.

I have done small tests using just 12 files, each with only 4 lines, each line being only about 1,000 characters long, as well as having only alphanumeric, commas, periods, and no [SEP] token. Same error.

I have tried using a per_device_train_batch_size and per_device_eval_batch_size of 1, 8, and 500 to ensure this was not the issue, but no luck.

In the full version of the code, I cache the tokenized datasets (as below), but when the program tries to load them on subsequent runs, it gives an error saying "An error occurred while generating the dataset", which indicates to me that even though we can tokenize the dataset without error, it is not actually in the correct format, and so is likely where the issue is.

Saving tokenized dataset: tokenized_train_dataset.save_to_disk(tokenized_train_dataset_cache_path)

Loading tokenized dataset: tokenized_train_dataset = load_dataset(tokenized_train_dataset_cache_path)

I realize that this training input wont necessarily create the desired output for a true chatbot, but we want to get this running to understand a baseline before we look into formatting our data further to include input and output labels.

It is also probably really important to point out that, for testing purposes, the test and validation files are basically just placeholders for now, where each file is just three sample inputs from our training data, as I am not yet sure how to format these for text training input as we're working with.

I would be very grateful to anybody who can shed some light or point me in the right direction. Thank you in advance.`

1

There are 1 answers

0
capnchat On

This YouTube video is what got me unstuck: https://www.youtube.com/watch?v=Q9zv369Ggfk

But the corresponding Google Colab project is what was most helpful: https://colab.research.google.com/drive/1IqL0ay04RwNNcn5R7HzhgBqZ2lPhHloh?usp=sharing#scrollTo=1GUD7mBRp2qH

Do not be a knucklehead like me and use code for "sequence classificiation", and be sure to use the right code for "text generation". Effectively, this means I should have removed all the stuff for compute_metric, accuracy, test/evaluation datasets, and I ended up simply passing in training data with the AutoModelForCasualLM. I hope this helps someone.

I did have to get Google Colab Pro (not Pro+) in order to utilize the T5 graphics capabilities; however, at the end I was not able to run my bot in GPU mode, and had to run it in CPU mode.

Also, I did find out that you can in fact have .txt files where individual lines exceed the max sequence length of a model. Meaning that if your model has a max sequence length of 2048, then you can have lines that far exceed that without issue.