Using WordPiece tokenization with RoBERTa

641 views Asked by At

As far as I understood, the RoBERTa model implemented by the huggingface library, uses BPE tokenizer. Here is the link for the documentation:

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.

However, I have a custom tokenizer based on WordPiece tokenization and I used the BertTokenizer.

Because my customized tokenizer is much more relevant for my task, I prefer not to use BPE.

When I pre-trained the RoBERTa from scratch (RobertaForMaskedLM) with my custom tokenizer the loss for the MLM task was much better than the loss with BPE. However, when it comes to fine-tuning, the model (RobertaForSequenceClassification) perform poorly. I am almost sure the problem is not about the tokenizer. I wonder if the huggingface library for the RobertaForSequenceClassification is not compatible with my tokenizer.

Details about the fine-tuning:

task: multilabel classification with imbalanced labels.

epochs: 20

loss: BCEWithLogitsLoss()

optimizer: Adam, weight_decay_rate:0.01, lr: 2e-5, correct_bias: True

The F1 and AUC was very low because the output probabilities for the labels was not in accordance with the actual labels (even with a very low threshold) which means the model couldn't learn anything.

*

Note: The pre-trained and fine-tuned RoBERTa with BPE tokenizer performs better than the pre-trained and fine-tuned with custom tokenizer although the loss for MLM with custom tokenizer was better than BPE.

0

There are 0 answers