I just want to measure performance (IPC, Throughput, Duration) of application that contain some kernels and CUDA APIs. How can I measure 'real duration' of these application using Nsight Compute or System?
I knew that Nsight Compute Profiler is not for application but for specific single kernel profile.
NVIDIA Manual also said that Nsight Compute serialize all kernels for profiling each kernel exactly.
So on, I saw NCU support 'range replay, application range replay mode' not serialize kernel and profile metrics of concurrent kernels. Is that also profile CUDA API's activity like cudaMemcpy()?
I just want to measure total performance and utilization of some application.
The meaning of total performance is not just a single kernel or average kernel performance like gpc__elapsed_cycle.max. This value means just single kernel and serialized kernel duration value.
I want to expect get the real duration value when application also has async kernels or CUDA APIs.