How to measure Streaming Multiprocessor use/idle times in CUDA?

1.6k views Asked by At

A simple question, really: I have a kernel which runs with the maximum number of blocks per Streaming Multiprocessor (SM) possible and would like to know how much more performance I could theoretically extract from it. Ideally, I'd like to know the percentage of SM cycles that are idle, i.e. all warps are blocked on memory access.

I'm really just interested in finding that number. What I'm not looking for is

  • General tips on increasing occupancy. I'm using all the occupancy I can get, and even if I manage to get more performance, it won't tell me how much more would theoretically be possible.
  • How to compute the theoretical peak GFlops. My computations are not FP-centric, there's a lot of integer arithmetic and logic going on too.
1

There are 1 answers

1
Greg Smith On BEST ANSWER

Nsight Visual Studio Edition 2.1 and 2.2 Issue Efficiency experiments provides the information you are requesting. These counters/metrics should be added to the Visual Profiler in a release after CUDA 5.0.

Nsight Visual Studio Edition

From Nsight Visual Studio Edition 2.2 User Guide | Analysis Tools | Other Analysis Reports | Profiler CUDA Settings | Issue Efficiency Section

Issue Efficiency provides information about the device's ability to issue the instructions of the kernel. The data reported includes execution dependencies, eligible warps, and SM stall reasons.

For devices of compute capability 2.x, a multiprocessor has two warp schedulers. Each warp scheduler manages at most 24 warps, for a total of 48 warps per multiprocessor. The kernel execution configuration may reduce the runtime limit. For information on occupancy, see the Achieved Occupancy experiment. The first scheduler is in charge of the warps with an odd ID, and the second scheduler is in charge of warps with an even ID.

KEPLER UPDATE: For compute capability 3.x, a multiprocessor has four warp schedulers. Each warp scheduler manages at most 16 warps, for a total of 64 warps per multiprocessor.

At every instruction issue time, each scheduler will pick an eligible warp from its list of active warps and issue an instruction. A warp is eligible if the instruction has been fetched, the execution unit required by the instruction is available, and the instruction has no dependencies that have not been met.

The schedulers report the following statistics on the warps in the multiprocessor:

Active Warps – A warp is active from the time it is scheduled on a multiprocessor until it completes the last instruction. The active warps counter increments by 0-48 per cycle. The maximum increment per cycle is defined by the theoretical occupancy.

KEPLER UPDATE Range is 0-64 per cycle.

Eligible Warps – An active warp is eligible if it is able to issue the next instruction. Warps that are not eligible will report an Issue Stall Reason. This counter will increment by 0-ActiveWarps per cycle.

UPDATE On Fermi the Issue Stall Reason counters are updated only on cycles that the warp scheduler had not eligible warps. On Kepler the Issue Stall Reason counters are updated every cycle even if the warp scheduler issues an instruction.

Zero Eligible Warps – This counter increments each cycle by 1 if neither scheduler has a warp that can be issued.

One Eligible Warp – This counter increments each cycle by 1 if only one of the two schedulers has a warp that can be issued.

KEPLER UPDATE: On Kepler the counters are per scheduler so the One Eligible Warp means that the warp scheduler could issue an instruction. On Fermi there is a single counter for both schedulers so on Fermi you want One Eligible Warp counter to be as small as possible.

Warp Issue Holes – This counter increments each cycle by the number of active warps that are not eligible. This is the same as Active Warps minus Eligible Warps.

Long Warp Issue Holes – This counter increment each cycle by the number of active warps that have not been eligible to issue an instruction for more than 32 clock cycles. Long holes indicate that warps are stalled on long latency reasons such as barriers and memory operations.

Issue Stall Reasons – Each cycle each ineligible warp will increment one of the issue stall reason counters. The sum of all issue stall reason counters is equal to warp issue holes. A ineligible warp will increment the Instruction Fetch stall reason if the next assembly instruction has not yet been fetched. Execution Dependency stall reason if an input dependency is not yet available. This can be reduced by increasing the number of independent instructions. Data Requests stall reasons if the request cannot currently be made as the required resources are not available, or are fully utilized, or too many operations of that type are already outstanding. In case data requests make up a large portion of the stall reasons, you should also run the memory experiments to determine if you can optimize existing transactions per request or if you need to revisit your algorithm. Texture stall reason if the texture sub-system is already fully utilized and currently not able to accept further operations. Synchronization stall reason if the warp is blocked at a __syncthreads(). If this reason is large and the kernel execution configuration is limited to a small number of blocks then consider dividing the kernel grid into more thread blocks.

Visual Profiler 5.0

The Visual Profiler does not have counters that address your question. Until the counters are added you can use the following counters:

  • sm_efficiency[_instance]
  • ipc[_instance]
  • achieved_occupancy.

The target and max IPCs for compute capabilities are:

Compute     Target   Max
Capability  IPC      IPC
2.0         1.7      2.0
2.x         2.3      4.0
3.x         4.4      7.0

The target IPC is for ALU limited computation. The target IPC for memory bound kernels will be less. For compute capability 2.1 devices and higher it is harder to use IPC as each warp scheduler can dual-issue.