Imagine that I have two kernels like this:
__global__ void kernel_2(...) {
// do something
}
__global__ void kernel_1(...) {
kernel_2<<<n, m>>>(...);
__syncthreads();
}
int main() {
kernel_1<<<x, y>>>(...);
cudaDeviceSynchronize();
}
Everything works, no problem. But I'm trying to check these two kernel running time in Nsight System.
But what I see there is only the running time for kernel_1
and then an empty gap, which I assume that gap is the running time for kernel_2
, but it doesn't show that over there.
Is there any way that I can make it to be displayed there?
It is possible to use nvtx to manually mark begin and end of kernels launched from the device. This will not be as accurate as profiling a stand-alone kernel, but can give you some approximate running times.
The following code uses
cuda::latch
in managed memory to communicate kernel begin and end with the host. The host will then insert a custom range into the nsys profile. I used CUDA 12.3.Profiling this with nsight systems 2023.3.1 shows the custom marker for kernel_2.