I’m enriching DistilBert tokenizer with new tokens from new corpus. DistilBert
uses WordPiece
tokenizer, and based on the Huggingface NLP course, the inference is done by finding the “longest possible token” from the beginning of the word, splitting it, and doing the same for the rest of the word.
In my tokenizer, however, I have inspect
, insp
, ec
, ##ec
, ##t
tokens, but when tokenizing inspect, the tokenizer comes up with the following tokens: ['insp', 'ec', '##t']
.
I would expect the tokenizer to return only one token: 'inspect'. Even if it splits, I would expect it to return at least ['insp', '##ec', '##t']
.
Is this a bug or some of the part of my code is incorrect?
Minimum working example:
>> from transformers import AutoTokenizer
>> model_checkpoint = 'elastic/distilbert-base-uncased-finetuned-conll03-english'
>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, False, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'inspect', '[SEP]']
>> tokenizer.add_tokens(['insp'])
# 1
>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, True, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'insp', 'ec', '##t', '[SEP]']