I have deployed T5 tensorrt model on nvidia triton server and below is the config.pbtxt file, but facing problem while inferencing the model using triton client.
As per the config.pbtxt file there should be 4 inputs to the tensorrt model along with the decoder ids. But how can we send decoder as input to the model I think decoder is to be generated from models output.
name: "tensorrt_model"
platform: "tensorrt_plan"
max_batch_size: 0
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "attention_mask"
data_type: TYPE_INT32
dims: [-1, -1 ]
},
{
name: "decoder_input_ids"
data_type: TYPE_INT32
dims: [ -1, -1]
},
{
name: "decoder_attention_mask"
data_type: TYPE_INT32
dims: [ -1, -1 ]
}
]
output [
{
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [ -1, -1, 768 ]
},
{
name: "input.151"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
You have several examples in the NVIDIA Triton Client repository. However, it might be the case, if your use case is too complex, that you might need the Python backend instead of the Torch one.
You initialize the client as follows:
Considering that you already have the client initialized, in Python you will need to create a function to generate the requests, such as the following.
Then, you can use this
request_generator
in your loop to run inferences:As I said, this is just a simple illustration on how you should do it, but it misses a lot of the implementation details. You have lots of examples in the repo, as I said.