I am using a model for token classification task on a medical domain using huggingface transformers. Unfortunately, I don’t have enough data to set up a new tokenizer and train a new model from scratch, so I am using the existing bert-based model and fine-tuning it. I want, however, to add some domain-specific words/tokens to boost the performance.
My initial thought was to make a new WordPiece tokenizer with limited vocabulary size on the medical domain and add tokens to the pre-trained tokenizer that are missing from there. However, I came up with this article that suggests to use SpaCy
tokenizer with sklearn's TfidfVectorizer
and add only words, rather than tokens, as the new tokens might mess up the existing logic of the pre-trained tokenizer.
Any suggestion of which approach might be better?