We are trying to finetune / train our RoBERTa model on our own train data. The project is exactly the same as the SemEval-2020 task B on choosing the right reason out of 3 on why a sentence is against common sense. For the past two days we have been struggling with errors, mainly when trying to train/finetune our model. The code we have used comes from https://huggingface.co/transformers/model_doc/roberta.html#robertamodel . Although we have tried to alter this code in multiple ways we can't seem to really start training our model. Our main problem is the data we try to train the model on. We have tried to immediately insert a numpy array or pandas dataframe, but to no avail. Finally we tried to use a tfds. We used the following code, which results in the error code which can be found below.
I install the following packages:
# Modules
pip install transformers
from transformers import RobertaConfig, RobertaModel
from transformers import RobertaTokenizer, RobertaForMultipleChoice
from transformers import AutoModel, AutoTokenizer
import tensorflow as tf
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
I import my train and test sets as csv files, where after the data is cleaned and concatenated.
# Read csv files
sample_data = pd.read_csv("sample.csv")
train_data = pd.read_csv("train_data.csv")
train_answers = pd.read_csv("train_answers.csv")
test_data = pd.read_csv("test_data.csv")
train_data['A_concat'] = train_data['FalseSent'] + '. ' + train_data['OptionA']
train_data['B_concat'] = train_data['FalseSent'] + '. ' + train_data['OptionB']
train_data['C_concat'] = train_data['FalseSent'] + '. ' + train_data['OptionC']
train_data.drop(['OptionA','OptionB','OptionC'],axis=1,inplace=True)
train_data.columns = ['id', 'FalseSent', 'OptionA', 'OptionB', 'OptionC']
I make a tfds of the train and test data:
tf_train_data = tf.data.experimental.CsvDataset("train_data.csv", record_defaults='Tensor', header=True)
tf_test_data = tf.data.experimental.CsvDataset("test_data.csv", record_defaults='Tensor', header=True)
These data sets are then used to train the model through the following code:
from transformers import TFRobertaForMultipleChoice, TFTrainer, TFTrainingArguments
model = TFRobertaForMultipleChoice.from_pretrained("roberta-base")
training_args = TFTrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
)
trainer = TFTrainer(
model=model,
args=training_args,
train_dataset=tf_train_data,
eval_dataset=tf_test_data
)
trainer.train()
This results in the following error:
ValueError: The training dataset must have an asserted cardinality
If anybody has any advice or can point us in the right direction on how to train/finetune our model we would be very grateful!
You should feed sequences of integers (tf.int) to Roberta not tf.string. Tokenize your data