Interpreting PGI_ACC_TIME output

938 views Asked by At

I have some OpenACC-accelerated C++ code that I've compiled using the PGI compiler. Things seem to be working, so now I want to play efficiency whack-a-mole with profiling information.

I generate some timing info by setting:

export PGI_ACC_TIME=1

And then running the program.

The following output results:

-bash-4.2$ ./a.out 
libcupti.so not found

Accelerator Kernel Timing data
    PGI_ACC_SYNCHRONOUS was set, disabling async() clauses
/home/myuser/myprogram.cpp
  _MyProgram  NVIDIA  devicenum=1
    time(us): 97,667
    75: data region reached 2 times
        75: data copyin transfers: 3
             device time(us): total=101 max=82 min=9 avg=33
    76: compute region reached 1000 times
        76: kernel launched 1000 times
            grid: [1938]  block: [128]
            elapsed time(us): total=680,216 max=1,043 min=654 avg=680
    95: compute region reached 1000 times
        95: kernel launched 1000 times
            grid: [1938]  block: [128]
            elapsed time(us): total=487,365 max=801 min=476 avg=487
    110: data region reached 2000 times
        110: data copyin transfers: 1000
             device time(us): total=6,783 max=140 min=3 avg=6
        125: data copyout transfers: 1000
             device time(us): total=7,445 max=190 min=6 avg=7

real    0m3.864s
user    0m3.499s
sys     0m0.348s

It raises some questions:

  1. I see time(us): 97,667 at the top. This seems like a total time, but, at the bottom, I see real 0m3.864s. Why is there such a difference?

  2. If time(us): 97,667 is the total, why is it so much smaller than values lower down, such as elapsed time(us): total=680,216?

  3. This kernel including the line (elapsed time(us): total=680,216 max=1,043 min=654 avg=680) was run 1000 times. Are max, min, and avg values based on per-run values of the kernel?

  4. Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?

  5. For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data (preparing to transfer, post-receipt operations)?

  6. The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a variable definition. Should line numbers be interpreted as "the loop most closely following the indicated line number", or in some other way?

  7. The kernel at 76 contains the kernel at 95. Are the times calculated for 76 inclusive of time spent in 95? If so, is there a convenient way to find the time spent in a kernel minus the times of all the subkernels?

(Some of these questions are a bit anal retentive, but I haven't found documentation for this, so I thought I'd be thorough.)

1

There are 1 answers

2
Mat Colgrove On

Part of the issue here is that the runtime can't find the CUDA Profiling library (libcupti.so), hence you're only seeing the PGI CPU side profiling not the device profiling. PGI ships libcupti.so library with the compilers (under $PGI/[linux86-64|linuxpower]/2017/cuda/[7.5|8.0]/lib64) but this is an optional install so you may not have it install on the system you're running. CUPTI also ships with the CUDA SDK, so if the system has CUDA install, you can try setting you're LD_LIBRARY_PATH there instead. On my system it's installed in "/opt/cuda-8.0/extras/CUPTI/lib64/".

The missing CUPTI library is why you're seeing the bad time, 97,667, for the file time. Also since you're missing CUPTI, the time you're seeing is being measured from the host. With CUPTI, in addition to the elapsed time, you'd see the device time for each kernel. The difference between the elapsed time and the device time is the launch overhead per kernel.

Are max, min, and avg values based on per-run values of the kernel?

Yes.

4.Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?

I tend to first look at the avg time since there's typically more opportunities to tune these loops. If you are varying the amount of work per kernel iteration (i.e the grid size changes), then it might not be as useful, but a good starting point.

Now if you had a low average but many calls, then the elapsed time may be dominated by kernel launch overhead. In which case, I'd look to see if you can combine loops or push more work into each loop.

5.For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data (preparing to transfer, post-receipt operations)?

Just the data transfer time. For the overhead, you would need to use PGPROF/NVPROF.

6.The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a variable definition. Should line numbers be interpreted as "the loop most closely following the indicated line number", or in some other way?

It's because the code's been optimized so the line number may be a bit off though it should correspond to the line numbers from compiler feedback messages (-Minfo=accel). So "the loop most closely..." option should be correct.