I am trying to fine-tune distilBert model on custom dataset but I am countring the following error while training the model:
TypeError: '>' not supported between instances of 'NoneType' and 'int'
The dataset feature is starbacks reviews and its target variable is ratings from 1-5.
My code:
Importing Necessaries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import pipeline
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments
Reading the dataset
data_path = '/kaggle/input/starbucks-reviews-dataset/reviews_data.csv'
ds_name = 'Starbucks Reviews'
df = pd.read_csv(data_path)
Removing Null Values First I removed all the Users that did not provide reviews then I replaced all nan values in rating column by using pre-trained model nlptown from huggingface to classify all the reviews with no ratings from 1-5
dropped_rev = list(df['Review'][df['Review'] == "No Review Text"].index)
df.drop(dropped_rev, axis=0, inplace=True)
classified_rev = list(df['Review'][df['Rating'].isnull()].values)
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
toknizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model_name, tokenizer=toknizer)
results = classifier(classified_rev)
labels = []
for result in results:
print(f"Label: {result['label']}, Score: {result['score']}")
labels.append(int(result['label'][0]))
if len(labels) == df['Rating'].isnull().sum():
df.loc[df['Rating'].isnull(), 'Rating'] = labels
df['Rating'] = df['Rating'].astype('int64')
Converting the range from 0-4 to 1-5
df['Rating'].replace({1:0, 2:1, 3:2, 4:3, 5:4}, inplace=True)
Splitting the data into Train and Test datasets
X = list(df['Review'])
y = list(df['Rating'])
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=123)
Define the toknizer
toknizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encoding = toknizer(X_train, truncation=True, padding=True)
test_encoding = toknizer(X_test, truncation=True, padding=True)
Converting Encodings into tensors
train_dataset = tf.data.Dataset.from_tensor_slices(
(
dict(train_encoding),
y_train
)
)
test_dataset = tf.data.Dataset.from_tensor_slices(
(
dict(test_encoding),
y_test
)
)
Define Training Argument
training_args = TFTrainingArguments(
output_dir='./results',
num_train_epochs=2,
per_device_train_batch_size=8,
per_device_eval_batch_size=16,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
)
Train the model
with training_args.strategy.scope():
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=5)
trainer = TFTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset
)
trainer.train()
I am encountring the error on the trainer.train() line.
Here is a lot of things I have tried
- I have checked the datatype of y_train_encoded and y_test_encoded by typing both are int64.
- I have already checked all the values in the y_train and y_test and all are from 0 to 4
- I have checked the dataset compatability, dataset are compatable the X_train is a list of strings with 650 reviews and y_train_encoded is a numpy array with values between 0-4 and with integer datatype with shape 650 while X_test is a list of strings with 163 reviews and y_test_encoded is a numpy array with values between 0-4 and with integer datatype with shape 163 so I think dataset is compatable
- I checked the X_train and X_test all are strings and there is no mixed datatypes
I don't know yet what is the reason for this error? I want to solve it as soon as possible