I'm trying to use my own vocab_file with GPT2Tokenizer but I'm facing issues when I'm trying to use certain tokens.
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', vocab_file="./vocab.json")
encoding = tokenizer("Pa Pa Cl Cl Cl", return_tensors="pt", padding=True, truncation=True)
In the above case it works as expected but say I change the string to "Pa Pa Cl Cl Cl Nb"
I get an error as follows,
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
My vocab_file is here