How to modify a trained SentencePiece tokenizer to stop splitting the chatml tokens?

74 views Asked by vgoklani At 28 November 2023 at 16:44

We are using a pre-trained SentencePiece tokenizer (the SentencePiece tokenizer from Google, not huggingface), and we would like to preserve the chatML tokens:

<|im_start|> and <|im_end|>

We don't want to split the tokens, and we want the tokenizer to map them to separate values.

sp_model = SentencePieceProcessor(model_file=...)

Using the python implementation, how should we modify the model to do this? Thanks!

Original Q&A

TechQA.

How to modify a trained SentencePiece tokenizer to stop splitting the chatml tokens?

There are 0 answers

Related Questions in TOKENIZE

Related Questions in SENTENCEPIECE

Related Questions in NLTOKENIZER

Popular Questions

Popular Tags

Trending Questions