I have a simple matrix multiplication kernel running on CUDA.
When compiling using -lineinfo command along with --ptxas-options -v the register count is displayed as 28, whereas without the -lineinfo option, the register count is 20.
Exact commands used:
nvcc -g -G --ptxas-options -v -arch=sm_86 -o mmul_dbg mmul.cu
and
nvcc -lineinfo --ptxas-options -v -arch=sm_86 -o mmul_ncu mmul.cu
I also checked with
nvcc --ptxas-options -v -arch=sm_86 -o mmul_dbg mmul.cu
and it yields 20 registers.
__global__ void matrixMul(const int *a, const int *b, int *c, int N) {
// Compute each thread's global row and column index
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Iterate over row, and down column
c[row * N + col] = 0;
for (int k = 0; k < N; k++) {
// Accumulate results for a single element
c[row * N + col] += a[row * N + k] * b[k * N + col];
}
}
What could be the reason for the increased register count?
Edit: nvcc is 12.3
Edit (2): removed image and added textual output
$ nvcc --ptxas-options -v -lineinfo -o wlineinfo -arch=sm_86 m mul.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z9matrixMulPKiS0_Pii' for 'sm_86'
ptxas info : Function properties for _Z9matrixMulPKiS0_Pii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 380 bytes cmem[0]
$ nvcc --ptxas-options -v -g -G -o wlineinfo -arch=sm_86 mmul.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z9matrixMulPKiS0_Pii' for 'sm_86'
ptxas info : Function properties for _Z9matrixMulPKiS0_Pii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 380 bytes cmem[0]
The reason for the difference is the use of the
-Gswitch. This selects compilation in debug mode. Once we acknowledge these statements, then we can say that the observation has nothing to do with the use of-lineinfo.In debug mode, many/most optimizations are disabled. One optimization the compiler may use but is disabled is loop unrolling.
In the non-
-Gcase, the compiler implements loop unrolling. The overall number of instructions in the kernel is substantially higher, and a possible side effect of loop unrolling for performance is increased register pressure.So due to loop unrolling in the non-
-Gcase, the compiler has chosen a different register footprint to carry data. The character limits in the answer prevent me from providing full output for both cases, but you can get it yourself with thecuobjdumptool. Here is a portion (first part) of the output from the unrolled/non--Gcase:At the tail end of the above listing, you will find a sequence of instructions that repeats, roughly like this:
That repeating sequence represents the unrolled loop body. If you use the
cuobjdumptool to study the-Gcode, you will find: 1. a fewer number of instructions overall, 2. no repeating sequence as indicated above.I acknowledge this answer does not provide a detailed, precise description of the reason for the increased register use in the optimized case. That would require more careful analysis and counting, as well as probably some conjecture about compiler behavior.
Loop unrolling by itself does not necessarily/automatically imply increased register usage, but the two are often related.