Right now I am trying to train/finetune a pretrained RoBERTa model with a multichoice head, but I am having difficulty finding the right input so my model is able to train/finetune.
The dataframe I have right now looks like this:
With the 3 options being tokenized sentences, using:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
for i in range(0, len(train_data)):
train_data["OptionA"][i] = tokenizer.encode(train_data["OptionA"][i])
train_data["OptionB"][i] = tokenizer.encode(train_data["OptionB"][i])
train_data["OptionC"][i] = tokenizer.encode(train_data["OptionC"][i])
My evaluation set looks like this as well, with the test set having 6500 rows and the evaluation set having 1500 rows. I am trying to implement this with:
from transformers import RobertaForMultipleChoice, Trainer, TrainingArguments
model = RobertaForMultipleChoice.from_pretrained('roberta-base')
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=1, # total # of training epochs
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=32, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
)
trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset = train_split, # training dataset
eval_dataset = eval_split # evaluation dataset
)
trainer.train()
But I keep getting different keyerrors, an example:
KeyError: 2526
If anyone knows what I am doing wrong, I would be very grateful as I am stuck trying to train this model for the past 3 days.
Roberta Model accepts input of 514 tokens. However the input text should always be of 512 tokens. The additional two tokens are [CLS] and [SEP]. You need to set truncation parameter to truncate any additional text and enable padding for any input that's less than 512 tokens. I'm adding the link to my other comment to do this - https://stackoverflow.com/a/76318049/21949232
Below is the general input structure for Roberta Model.
RoBERTa: [CLS] + tokens + [SEP] + padding.