Features have excessive nesting error when trying to use my own vocab_file

44 views Asked by At

I'm trying to use my own vocab_file with GPT2Tokenizer but I'm facing issues when I'm trying to use certain tokens.

tokenizer = GPT2Tokenizer.from_pretrained('gpt2', vocab_file="./vocab.json")
encoding = tokenizer("Pa Pa Cl Cl Cl", return_tensors="pt", padding=True, truncation=True)

In the above case it works as expected but say I change the string to "Pa Pa Cl Cl Cl Nb" I get an error as follows,

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

My vocab_file is here

0

There are 0 answers