Why does TensorRT enqueueV2 take longer time when using more isolated threads in C++?

156 views Asked by At
  • OS : Windows 10
  • CUDA : version 11.5
  • TensorRT : 8.6.1.6
  • OpenCV : 4.8.0 built with CUDA
  • Driver version: Most recent Driver(545.84)

In my app, multiple cameras are going to be streamed. Each camera will be managed by a single CPU thread and there is not any kind of sharing between these threads. Besides, each thread will load and use an object detection model deployed with TensorRT. Each thread will have its own model and models are not shared either.

For 2 threads, the TensortRT enqueuev2 function that does the inference process on the model, nearly takes 1 milliseconds on average that seems pretty promising. I use the NVIDIA Nsight System tool to profile the program.

When I profile the program on Nsight System, the following picture shows the profiling report.

enter image description here

As it is shown in the image above, for one of the threads, enqueuev2 has taken 238 microseconds that seems good for two threads.

For the second test I launch 20 threads, but this time TensortRT enqueue has nearly taken 20 milliseconds to return the results and this value keeps growing as I increase the number of threads. You can see the Nishgt System report in the following image :

enter image description here

As it is shown in the tilt color, enqueuev2 has taken 19 milliseconds to return the result that impacts the final FPS of each thread adversely.

I looked in the reports closely and saw some Blocked States that am not familiar with and I think this is the bottleneck of the program, But have no idea how to reduce or eliminate them.

Here is the code snippet I used for the inference process :

        auto t1 = std::chrono::high_resolution_clock::now();
        status = mContext->enqueueV2(&mBindingDataHolder[0], *inferenceCudaStream, nullptr);
        CUDA_CHECK(cudaEventRecord(syncEvent, *inferenceCudaStream));
        CUDA_CHECK(cudaEventSynchronize(syncEvent)); 

        auto t2 = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();
        spdlog::info("enqueueV2 time: {} ms", duration);

As the resources and threads are isolated and not sharing anything, I would expect the enqueue function for each thread to take around 1 milliseconds, similar to the first 2-thread case, but it is 20 times slower.

It would be appreciated if anyone could tell me whether these Blocked States are the root cause of the problem and if they are how can solve it.

0

There are 0 answers