when decode a series of tokens from stream inference, how to avoid partial token?

17 views Asked by At

I want to implement a LLM inference server which holds a collection of huggingface models, but for stream inference, which return a token at a time. then token which returns may not enough to decode to a readable word. So what should I do to achieve such goal: only returns when token can be decode to a readable word?

0

There are 0 answers