How to get average execution time of CUDA kernel using NSight Systems or NSight Compute

711 views Asked by At

Suppose I have a simple CLI test app named "Foo". This app executes a kernel "Bar" 100 times in a loop. How may I obtain an average kernel execution time for Bar, using Nsight Systems or Nsight Compute, either the GUI or CLI versions of these apps.

The Nvidia Visual Profiler app provides this information in the Properties dialog, for each kernel, as "Duration (kernel)" and Invocations.

I would like to obtain the same information with Systems or Compute. Because Visual Profiler is to be deprecated.

Following the example in this post

nv-nsight-cu-cli -k Bar Foo

I get a 100x printouts, one for each kernel execution. I want just summary information for kernel Bar.

2

There are 2 answers

0
Anis Ladram On BEST ANSWER

You can achieve this with the Nsight Compute CLI using option --print-summary per-gpu: it provides a minimum, maximum and average execution time. Example below:

$ ncu -k matrixMul --print-summary per-gpu ./test | grep -C8 Duration
      ----------------------- ------------- ---------- ---------- ----------
      Metric Name               Metric Unit    Minimum    Maximum    Average
      ----------------------- ------------- ---------- ---------- ----------
      DRAM Frequency          cycle/nsecond       6.72       6.90       6.79
      SM Frequency            cycle/nsecond       1.48       1.51       1.49
      Elapsed Cycles                  cycle 166,647.00 168,469.00 167,522.43
      Memory Throughput                   %      73.43      74.10      73.76
      DRAM Throughput                     %       2.50       2.57       2.53
      Duration                      usecond     111.20     112.90     112.18
      L1/TEX Cache Throughput             %      84.50      85.35      84.99
      L2 Cache Throughput                 %      10.40      10.64      10.54
      SM Active Cycles                cycle 144,432.91 145,882.70 145,043.22
      Compute (SM) Throughput             %      73.43      74.10      73.76
      ----------------------- ------------- ---------- ---------- ----------

      Section: Launch Statistics
      -------------------------------- --------------- ---------- ---------- ----------
0
Zois Tasoulas On

Using nsys you can use

nsys stats -r cuda_kern_exec_sum <nsys-rep report>

Check also the :base, :mangled options for the report.

For more information on the report output you can use

nsys stats --help-reports=cuda_kern_exec_sum