I know how to time the execution of one CUDA kernel using CUDA events, which is great for simple cases. But in the real world, an algorithm is often made up of a series of kernels (CUB::DeviceRadixSort
algorithms, for example, launch many kernels to get the job done). If you're running your algorithm on a system with a lot of other streams and kernels also in flight, it's not uncommon for the gaps between individual kernel launches to be highly variable based on what other work gets scheduled in-between launches on your stream. If I'm trying to make my algorithm work faster, I don't care so much about how long it spends sitting waiting for resources. I care about the time it spends actually executing.
So the question is, is there some way to do something like the event API and insert a marker in the stream before the first kernel launches, and read it back after your last kernel launches, and have it tell you the actual amount of time spent executing on the stream, rather than the total end-to-end wall-clock time? Maybe something in CUPTI can do this?
You can use Nsight Systems or Nsight Compute. (https://developer.nvidia.com/tools-overview)
In Nsight Systems, you can profile timelines of each stream. Also, you can use Nsight Compute to profile details of each CUDA kernel. I guess Nsight Compute is better because you can inspect various metrics about GPU performances and get hints about the kernel optimization.