What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?

3.6k views Asked by At

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states.

Two of the items in this taxonomy are:

  • Short scoreboard - scoreboard dependency on an MIO queue operation.
  • Long scoreboard - scoreboard dependency on an L1TEX operation.

where, I presume, "scoreboard" is used the sense of out-of-order execution data dependency tracking (see e.g. here).

My questions:

  • What do the adjectives "short" or "long" describe? Is it the length of a single scoreboard? Two different scoreboards for the two different kinds of operations?
  • What's the meaning of this somewhat non-intuitive dichotomy between MIO - some, but not all of which are memory operations; and L1TEX operations, which are all memory operations? Is it a dichotomy w.r.t. stall reasons only or is it about actual hardware?
1

There are 1 answers

3
Greg Smith On

The NVIDIA GPU has two classification of instructions:

  1. Fixed latency - math, bitwise, register movement
  2. Variable latency - ld/st to shared, local, global, and texture as well as slow math operations

The Short Scoreboard and Long Scoreboard are reported on instructions dependent on data returned from a variable latency instruction. Short scoreboards are reported for dependencies coming for variable latency instructions that will not leave the SM such as slow math such as reciprocal sqrt or shared memory). Long scoreboards are reported for dependencies that may leave the SM such as global/local memory accesses and texture fetches.

Detailed descriptions from the Nsight Cmpute v2020.3.1 Kernel Profiling Guide

Long Scoreboard

Warp was stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, tex) operation. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality, or by changing the cache configuration, and consider moving frequently used data to shared memory.

Short Scoreboard

Warp was stalled waiting for a scoreboard dependency on a MIO (memory input/output) operation (not to L1TEX). The primary reason for a high number of stalls due to short scoreboards is typically memory operations to shared memory. Other reasons include frequent execution of special math instructions (e.g. MUFU) or dynamic branching (e.g. BRX, JMX). Verify if there are shared memory operations and reduce bank conflicts, if applicable.

MIO vs. L1TEX

MIO and L1TEX are partitions in the NVIDIA SM. The MIO units is responsible for shared execution units (shared by 1 or more SM sub-partitions) including lower rate math units (e.g. double precision on a GeForce chip) and memory input/output. The memory subsystems contains L1, TEX unit, shared memory unit, and other domain specific (e.g. graphics) interfaces to the SM. The implementation of the MIO subsystem including L1, TEX, and shared memory varies greatly between Kepler, Maxwell-Pascal, and Volta-Ampere. SM sub-partitions (warp schedulers) issues instructions to shared execution units through instruction queues vs. direct dispatch. For SM 7.0+ there are stall reasons (mio_throttle, lg_throttle, and tex_throttle) that occur if the instruction queues for those units are full.

What is included in the definition of MIO varies by architecture. L1TEX is technically in the MIO partition. The L1TEX has is complicated as it has two input interfaces:

  1. LSU interface is for shared memory, local/global memory (tagged), and special operations such as shuffle and special purpose registers.
  2. TEX interface is for texture fetches and on 7.0-8.x a subset of the slow math operations (e.g. FP64 on a GeForce card). The latter is a little confusing. The slow math units exist for binary compatibility and not expected to be used at the same time as texture fetches.

The term MIO can be confusing. The term L1TEX can also be confusing given two different interfaces. While there are two interfaces local/global and texture/surface share the same cache lookup stages, same cache RAM, and same SM to L2 interface so for many metrics the term L1TEX is used to refer to the unit.