I would like to load a pre-trained Bert model and to fine-tune it and particularly the word embeddings of the model using a custom dataset. The task is to use the word embeddings of chosen words for further analysis. It is important to mention that the dataset consists of tweets and there are no labels. Therefore, I used the BertForMaskedLM model.
Is it OK for this task to use the input ids (the tokenized tweets) as the labels? I have no labels. There are just tweets in randomized order.
From this point, I present the code I wrote:
First, I cleaned the dataset from emojis, non-ASCII characters, etc as described in the following link (2.3 Section): https://www.kaggle.com/jaskaransingh/bert-fine-tuning-with-pytorch
Second, the code of the fine tuning process:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.to(device)
model.train()
lr = 1e-2
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)
max_len = 82
chunk_size = 20
epochs = 20
for epoch in range(epochs):
epoch_losses = []
for j, batch in enumerate(pd.read_csv(path + file_name, chunksize=chunk_size)):
tweets = batch['content_cleaned'].tolist()
encoded_dict = tokenizer.batch_encode_plus(
tweets, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = max_len, # Pad & truncate all sentences.
pad_to_max_length = True,
truncation=True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
input_ids = encoded_dict['input_ids'].to(device)
# Is it correct? or should I train it in another way?
loss, _ = model(input_ids, labels=input_ids)
loss_score = loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
optimizer.zero_grad()
model.save_pretrained(path + "Fine_Tuned_BertForMaskedLM")
The loss starts from 50 and reduced until 2.3.
Since the objective of the masked language model is to predict the masked token, the label and the inputs are the same. So, whatever you have written is correct.
However, I would like to add on the concept of comparing word embeddings. Since, BERT is not a word embeddings model, it is contextual, in the sense, that the same word can have different embeddings in different context. Example: the word 'talk' will have a different embeddings in the sentences "I want to talk" and "I will attend a talk". So, there is no single vector of embeddings for each word. (Which makes BERT different from word2vec or fastText). Masked Language Model (MLM) on a pre-trained BERT is usually performed when you have a small new corpus, and want your BERT model to adapt to it. However, I am not sure on the performance gain that you would get by using MLM and then fine-tuning to a specific task than directly fine-tuning the pre-trained model with task specific corpus on a downstream task.