A simple question, really: I have a kernel which runs with the maximum number of blocks per Streaming Multiprocessor (SM) possible and would like to know how much more performance I could theoretically extract from it. Ideally, I'd like to know the percentage of SM cycles that are idle, i.e. all warps are blocked on memory access.
I'm really just interested in finding that number. What I'm not looking for is
- General tips on increasing occupancy. I'm using all the occupancy I can get, and even if I manage to get more performance, it won't tell me how much more would theoretically be possible.
- How to compute the theoretical peak GFlops. My computations are not FP-centric, there's a lot of integer arithmetic and logic going on too.
Nsight Visual Studio Edition 2.1 and 2.2 Issue Efficiency experiments provides the information you are requesting. These counters/metrics should be added to the Visual Profiler in a release after CUDA 5.0.
Nsight Visual Studio Edition
From Nsight Visual Studio Edition 2.2 User Guide | Analysis Tools | Other Analysis Reports | Profiler CUDA Settings | Issue Efficiency Section
KEPLER UPDATE: For compute capability 3.x, a multiprocessor has four warp schedulers. Each warp scheduler manages at most 16 warps, for a total of 64 warps per multiprocessor.
KEPLER UPDATE Range is 0-64 per cycle.
UPDATE On Fermi the Issue Stall Reason counters are updated only on cycles that the warp scheduler had not eligible warps. On Kepler the Issue Stall Reason counters are updated every cycle even if the warp scheduler issues an instruction.
KEPLER UPDATE: On Kepler the counters are per scheduler so the One Eligible Warp means that the warp scheduler could issue an instruction. On Fermi there is a single counter for both schedulers so on Fermi you want One Eligible Warp counter to be as small as possible.
Visual Profiler 5.0
The Visual Profiler does not have counters that address your question. Until the counters are added you can use the following counters:
The target and max IPCs for compute capabilities are:
The target IPC is for ALU limited computation. The target IPC for memory bound kernels will be less. For compute capability 2.1 devices and higher it is harder to use IPC as each warp scheduler can dual-issue.