We are using a pre-trained SentencePiece tokenizer (the SentencePiece tokenizer from Google, not huggingface), and we would like to preserve the chatML
tokens:
<|im_start|>
and <|im_end|>
We don't want to split the tokens, and we want the tokenizer to map them to separate values.
sp_model = SentencePieceProcessor(model_file=...)
Using the python implementation, how should we modify the model to do this? Thanks!