What is the right input / shape for training a pretrained RoBERTa?

368 views Asked by At

Right now I am trying to train/finetune a pretrained RoBERTa model with a multichoice head, but I am having difficulty finding the right input so my model is able to train/finetune.

The dataframe I have right now looks like this: enter image description here

With the 3 options being tokenized sentences, using:

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
for i in range(0, len(train_data)):
  train_data["OptionA"][i] = tokenizer.encode(train_data["OptionA"][i])
  train_data["OptionB"][i] = tokenizer.encode(train_data["OptionB"][i])
  train_data["OptionC"][i] = tokenizer.encode(train_data["OptionC"][i])

My evaluation set looks like this as well, with the test set having 6500 rows and the evaluation set having 1500 rows. I am trying to implement this with:

from transformers import RobertaForMultipleChoice, Trainer, TrainingArguments
model = RobertaForMultipleChoice.from_pretrained('roberta-base')

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                     # the instantiated  Transformers model to be trained
    args=training_args,              # training arguments, defined above
    train_dataset = train_split,     # training dataset
    eval_dataset = eval_split        # evaluation dataset
)

trainer.train()

But I keep getting different keyerrors, an example:

KeyError: 2526

If anyone knows what I am doing wrong, I would be very grateful as I am stuck trying to train this model for the past 3 days.

1

There are 1 answers

0
sastaengineer On

Roberta Model accepts input of 514 tokens. However the input text should always be of 512 tokens. The additional two tokens are [CLS] and [SEP]. You need to set truncation parameter to truncate any additional text and enable padding for any input that's less than 512 tokens. I'm adding the link to my other comment to do this - https://stackoverflow.com/a/76318049/21949232

Below is the general input structure for Roberta Model.

RoBERTa: [CLS] + tokens + [SEP] + padding.