How to run inference for T5 tensorrt model deployed on nvidia triton?

392 views Asked by At

I have deployed T5 tensorrt model on nvidia triton server and below is the config.pbtxt file, but facing problem while inferencing the model using triton client.

As per the config.pbtxt file there should be 4 inputs to the tensorrt model along with the decoder ids. But how can we send decoder as input to the model I think decoder is to be generated from models output.

name: "tensorrt_model"
platform: "tensorrt_plan"
max_batch_size: 0
input [
 {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1  ]
  },

{
    name: "attention_mask"
    data_type: TYPE_INT32
    dims: [-1, -1 ]
},

{
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1]
},

{
   name: "decoder_attention_mask"
   data_type: TYPE_INT32
   dims: [ -1, -1 ]
}

]
output [
{
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [ -1, -1, 768 ]
  },

{
    name: "input.151"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  }

]

instance_group [
    {
        count: 1
        kind: KIND_GPU
    }
]
1

There are 1 answers

0
bert On

You have several examples in the NVIDIA Triton Client repository. However, it might be the case, if your use case is too complex, that you might need the Python backend instead of the Torch one.

You initialize the client as follows:

import tritonclient.http as httpclient

triton_url = None  # your triton url
triton_client = httpclient.InferenceServerClient(url=url)

Considering that you already have the client initialized, in Python you will need to create a function to generate the requests, such as the following.

inputs_dtype = []  # list with inputs dtypes
inputs_name = []   # list with inputs name
outputs_name = []  # list with outputs name

def request_generator(data):
    client = httpclient
 
    inputs = [
        client.InferInput(input_name, data[i].shape,
            inputs_dtype[i]) for i, input_name in enumerate(inputs_name)
            ]
 
    for i, _input in enumerate(inputs):
        _input.set_data_from_numpy(data[i])
 
    outputs = [
        client.InferRequestedOutput(output_name) for output_name in outputs_name
        ]
 
    yield inputs, outputs

Then, you can use this request_generator in your loop to run inferences:

# assuming your data comes in a variable named data
# assuming your triton client is triton_client

data = preprocess(data)  # your preprocess function

model_name = None  # your model name
model_version = None  # your model version
       
responses = []
sent_count = 0
         
try:
    for inputs, outputs in request_generator(data):
        sent_count += 1
         
        responses.append(
            triton_client.infer(model_name,
                                inputs,
                                request_id=str(sent_count),
                                model_version=model_version,
                                outputs=outputs))
         
except InferenceServerException as exception:
    print("Caught an exception:", exception)

As I said, this is just a simple illustration on how you should do it, but it misses a lot of the implementation details. You have lots of examples in the repo, as I said.