I tried to add new words to the Bert tokenizer vocab
. I see that the length of the vocab is increasing, however I can't find the newly added word in the vocab.
tokenizer.add_tokens(['covid', 'wuhan'])
v = tokenizer.get_vocab()
print(len(v))
'covid' in tokenizer.vocab
Output:
30524
False
You are calling two different things with
tokenizer.vocab
andtokenizer.get_vocab()
. The first one contains the base vocabulary without the added tokens, while the other one contains the base vocabulary with the added tokens.Output: