How to deploy GPT-like model to Triton inference server?

366 views Asked by At

The tutorials on deployment GPT-like models inference to Triton looks like:

  1. Preprocess our data as input_ids = tokenizer(text)["input_ids"]
  2. Feed input to Triton inference server and get outputs_ids = model(input_ids)
  3. Postprocess outputs like
outputs = outputs_ids.logits.argmax(axis=2)
outputs = tokenizer.decode(outputs)

I use finetuned GPT2 model and this method gives incorrect result. The correct result will be obtained by model.decode(input_ids) method.

There is the way to deploy finetuned GPT-like huggingface model to Triton with inference model.decode(input_ids) not model(input_ids)?

0

There are 0 answers