TypeError while Fine-Tuning DistilBert Model

64 views Asked by At

I am trying to fine-tune distilBert model on custom dataset but I am countring the following error while training the model:

TypeError: '>' not supported between instances of 'NoneType' and 'int'

The dataset feature is starbacks reviews and its target variable is ratings from 1-5.

My code:

Importing Necessaries

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import missingno as msno

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import tensorflow as tf

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import pipeline
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

Reading the dataset

data_path = '/kaggle/input/starbucks-reviews-dataset/reviews_data.csv'
ds_name = 'Starbucks Reviews'
df = pd.read_csv(data_path)

Removing Null Values First I removed all the Users that did not provide reviews then I replaced all nan values in rating column by using pre-trained model nlptown from huggingface to classify all the reviews with no ratings from 1-5

dropped_rev = list(df['Review'][df['Review'] == "No Review Text"].index)
df.drop(dropped_rev, axis=0, inplace=True)

classified_rev = list(df['Review'][df['Rating'].isnull()].values)
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
toknizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model_name, tokenizer=toknizer)

results = classifier(classified_rev)
labels = []
for result in results:
    print(f"Label: {result['label']}, Score: {result['score']}")
    labels.append(int(result['label'][0]))

if len(labels) == df['Rating'].isnull().sum():
    df.loc[df['Rating'].isnull(), 'Rating'] = labels

df['Rating'] = df['Rating'].astype('int64')

Converting the range from 0-4 to 1-5

df['Rating'].replace({1:0, 2:1, 3:2, 4:3, 5:4}, inplace=True)

Splitting the data into Train and Test datasets

X = list(df['Review'])
y = list(df['Rating'])

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=123)

Define the toknizer

toknizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encoding = toknizer(X_train, truncation=True, padding=True)
test_encoding = toknizer(X_test, truncation=True, padding=True)

Converting Encodings into tensors

train_dataset = tf.data.Dataset.from_tensor_slices(
    (
        dict(train_encoding),
        y_train
    )
)

test_dataset = tf.data.Dataset.from_tensor_slices(
    (
        dict(test_encoding),
        y_test
    )
)

Define Training Argument

training_args = TFTrainingArguments(
    output_dir='./results',         
    num_train_epochs=2,             
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=16,   
    warmup_steps=500,               
    weight_decay=0.01,            
    logging_dir='./logs',          
    logging_steps=10,
)
  

Train the model

with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=5)

trainer = TFTrainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_dataset,         
    eval_dataset=test_dataset             
)

trainer.train()

I am encountring the error on the trainer.train() line.

Here is a lot of things I have tried

  • I have checked the datatype of y_train_encoded and y_test_encoded by typing both are int64.
  • I have already checked all the values in the y_train and y_test and all are from 0 to 4
  • I have checked the dataset compatability, dataset are compatable the X_train is a list of strings with 650 reviews and y_train_encoded is a numpy array with values between 0-4 and with integer datatype with shape 650 while X_test is a list of strings with 163 reviews and y_test_encoded is a numpy array with values between 0-4 and with integer datatype with shape 163 so I think dataset is compatable
  • I checked the X_train and X_test all are strings and there is no mixed datatypes

I don't know yet what is the reason for this error? I want to solve it as soon as possible

0

There are 0 answers