when decode a series of tokens from stream inference, how to avoid partial token?

17 views Asked by Gao At 02 February 2024 at 03:04

I want to implement a LLM inference server which holds a collection of huggingface models, but for stream inference, which return a token at a time. then token which returns may not enough to decode to a readable word. So what should I do to achieve such goal: only returns when token can be decode to a readable word?

Original Q&A

TechQA.

when decode a series of tokens from stream inference, how to avoid partial token?

There are 0 answers

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in HUGGINGFACE

Related Questions in TRITON

Popular Questions

Trending Questions