How to modify a trained SentencePiece tokenizer to stop splitting the chatml tokens?

55 views Asked by At

We are using a pre-trained SentencePiece tokenizer (the SentencePiece tokenizer from Google, not huggingface), and we would like to preserve the chatML tokens:

<|im_start|> and <|im_end|>

We don't want to split the tokens, and we want the tokenizer to map them to separate values.

sp_model = SentencePieceProcessor(model_file=...)

Using the python implementation, how should we modify the model to do this? Thanks!

0

There are 0 answers