I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.
But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program. Should I take the (average or largest value or total) of all kernels for each metric??
One possible approach would be to use a weighted average method.
Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.
Let's also suppose that the profiler reports the gld_efficiency metric as follows:
You could compute the weighted average as follows:
I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration: