Performance Analysis of Multiple Kernels (CUDA C)

Question

Performance Analysis of Multiple Kernels (CUDA C)

197 views Asked by Sarah Hamed At 06 November 2018 at 21:18

I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.

But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program. Should I take the (average or largest value or total) of all kernels for each metric??

Original Q&A

There are 1 answers

**Robert Crovella** · Answer 1 · 2018-11-07T16:00:40+00:00

One possible approach would be to use a weighted average method.

Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.

Let's also suppose that the profiler reports the gld_efficiency metric as follows:

kernel     duration    gld_efficiency
     1        10ms               88%
     2        20ms               76%
     3        30ms               50%

You could compute the weighted average as follows:

                                     88*10        76*20        50*30
"overall"  global load efficiency =  -----   +    -----    +   ----- = 65%
                                       60           60           60

I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:

kernel     gld_transactions    gld_efficiency
     1        1000               88%
     2        2000               76%
     3        3000               50%


                                     88*1000        76*2000        50*3000
"overall"  global load efficiency =  -------   +    -------    +   ------- = 65%
                                       6000           6000           6000

TechQA.

Performance Analysis of Multiple Kernels (CUDA C)

There are 1 answers

Related Questions in PERFORMANCE

Related Questions in CUDA

Related Questions in NVPROF

Popular Questions

Popular Tags

Trending Questions