Transfer learning (or fine-tuning) pre-trained model on non-text data

152 views Asked by corvusMidnight At 11 December 2022 at 09:40

I am currently fine-tuning a sentiment analysis bert-based model using PyTorch Trainer from hugging face. So far, so good.

I have easily managed to fine-tune the model on my text data. However, I'd like to conduct an ablation study to see how the inclusion of linguistic features impacts the model's performance.

In other words, how the inclusion of, e.g., comment length, type-to-token ratio, and other features (stored in my dataset in a separate column) affects the performance of my model.

This is what my data kind of looks like:

Text	Type-token ratio
Hi, my name is...	1.0
I cannot believe I did not...	0.95

In the specific case above, for instance, I would like to fine-tune the model on the text column but also on the Type-token ratio one.

I know that some people concatenate the two columns into a string, but I am not sure that is the correct method. Is there a more methodologically sound way of doing this?

I was not able to find much information about it.

EDIT:

(The below code works and, ideally, it should also include the TTR column.)

This is my code:

dataset = pd.read_csv('/content/gdrive/MyDrive/.../data_train.csv')
train_roberta = dataset[['text_lower', 'label']].sample(frac=0.75)
validation_roberta = dataset[['text_lower', 'label']].drop(train_roberta.index)
train_roberta = train_roberta.dropna()
validation_roberta = validation_roberta.dropna()
train = Dataset.from_pandas(train_roberta, preserve_index=False)
validation = Dataset.from_pandas(validation_roberta, preserve_index=False)

tokenizer = AutoTokenizer.from_pretrained("a_model/a_bert_like_model")

def tokenize_function(example):
    return tokenizer(example["text_lower"], padding="max_length", truncation=True)

tokenized_train_dataset = train.map(tokenize_function, batched=True)
tokenized_test_dataset = validation.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
import os
os.environ["WANDB_DISABLED"] = "true"
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")   # default arguments for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("a_model/a_bert_like_model", num_labels=3)


def compute_metrics(eval_preds):   # compute accuracy and f1-score
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(   # specifying trainer class
    model,
    training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()  # starts fine-tuning

Original Q&A

TechQA.

Transfer learning (or fine-tuning) pre-trained model on non-text data

There are 0 answers

Related Questions in PYTHON

Related Questions in ARTIFICIAL-INTELLIGENCE

Related Questions in BERT-LANGUAGE-MODEL

Related Questions in PRE-TRAINED-MODEL

Related Questions in FINE-TUNING

Popular Questions

Trending Questions