The tutorials on deployment GPT-like models inference to Triton looks like:
- Preprocess our data as
input_ids = tokenizer(text)["input_ids"]
- Feed input to Triton inference server and get
outputs_ids = model(input_ids)
- Postprocess outputs like
outputs = outputs_ids.logits.argmax(axis=2)
outputs = tokenizer.decode(outputs)
I use finetuned GPT2 model and this method gives incorrect result. The correct result will be obtained by model.decode(input_ids)
method.
There is the way to deploy finetuned GPT-like huggingface model to Triton with inference model.decode(input_ids)
not model(input_ids)
?