setfit training with a pandas dataframe

575 views Asked by At

I would like to train a zero shot classifier on an annotated sample dataset.

I am following some tutorials but as all use their own data and the same pretarined model, I am trying to confirm: Is this the best approach?

Data example: 

import pandas as pd
from datasets import Dataset
    
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {'text': 'The product is great and works well.', 'label': 'Product Performance'},
    {'text': 'I love the design of the product.', 'label': 'Product Design'},
    {'text': 'The product is difficult to use.', 'label': 'Usability'},
    {'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
    {'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]

# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)

# convert to Dataset format
df = Dataset.from_pandas(df)

By having the previous data format, this is the approach for model finetunning:

from setfit import SetFitModel, SetFitTrainer

# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={"text": "text", "label": "label"} 
)

trainer.train()

The issue here is that the process never ends after more than 500 hours in a laptop, and the dataset it is only about 88 records with 11 labels.

2

There are 2 answers

0
SilentCloud On BEST ANSWER

I tried to run the example you posted on Google Colab, it took 37 seconds to run the training.

Here's you code with some tweak to make it work on Colab:

### Install libraries
%%capture
!pip install datasets setfit

After installing the libraries, run the following code:

### Import dataset
import pandas as pd
from datasets import Dataset
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {'text': 'The product is great and works well.', 'label': 'Product Performance'},
    {'text': 'I love the design of the product.', 'label': 'Product Design'},
    {'text': 'The product is difficult to use.', 'label': 'Usability'},
    {'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
    {'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]
# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)
# convert to Dataset format
df = Dataset.from_pandas(df)

### Run training
from setfit import SetFitModel, SetFitTrainer
# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={"text": "text", "label": "label"} 
)
trainer.train()

And finally, you can download the trained model on drive and then download it on you PC manually.

### Download model to drive
from google.colab import drive
drive.mount('/content/drive')
trainer.model._save_pretrained('/content/drive/path/to/target/folder')

If your main issue is the training time, this should fix it.

0
Maciej Skorski On

Nothing wrong with your code but you need more powerful machine possibly with GPU to train Transformers. They are not for the poor :-) Try on Colab, Kaggle for free or on a private VM if you have a chance. It takes few seconds to make few epochs.

I am sharing a Colab Notebook here and here is how the performance and resources look like:

enter image description here

My advice would be to utilize free Kaggle Notebooks with GPU, slower than Colab (by a factor of about 4x in my experience) but more generous in terms of availability and time limits. Here is the Kaggle Notebook too for comparison and play.

Happy GPU training!