I'm loading a HF tokenizer, and wanted to stop on the sequence "</|im_end|>", but it looks like the tokenizer has 2 different ids for the same token, is it a bug or supposed to be so?
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("LoneStriker/AlphaMonarch-7B-AWQ", device_map='cuda')
model = AutoModelForCausalLM.from_pretrained("LoneStriker/AlphaMonarch-7B-AWQ", device_map='cuda')
tokenizer.decode(700) # '</'
tokenizer.decode(1867) # '</'
tokenizer.decode(700) == tokenizer.decode(1867)
The tokenization depends on whether the given token is at a beginning of a word in a text, or in the second-to-last place. Note the difference:
This is actually not that unique, even common tokens have two tokenizations depending on where in the word they appear, e.g. token
power:Some tokenizers include a prefix that signifies that the token only appears in second-to-last position in a word, e.g.:
This is true even for the tokenizer in question if you investigate the
tokenizer.vocabobject (here prefix▁signifies token that is at the beginning of words), however, I am not sure why it does not transfer to thetokenizer.decodefunction:As for stopping the generation, I would investigate which token or sequence of tokens is usually created at the end of sequence, and use that one as stopping criteria (or possibly both, I am not familiar with the concrete implementation).