Meaning of the "flop_count_sp" and "inst_fp_32" metric in CUDA Profiler

270 views Asked by At

According to the profiler user guide:

flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.

inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

I have a kernel with the profiler output can be added up to something like:

flop_count_sp = flop_count_sp_add + flop_count_sp_mul + 2 * flop_count_sp_fma
inst_fp_32 = flop_count_sp_add + flop_count_sp_mul + flop_count_sp_fma

Given the numbers in these metric, I am wondering what is an operation and what is an instruction here? It seems like a fma is one instruction, but two operations. Whereas add and mul is one instruction and one operation. Since SASS assembly is counted by the profiler. Are there any instructions that are not counted as operations? or vice versa. I only want to know in the context of nvprof and nvvp metrics.

Also, when we talk about peak performance in TFLOP/s, the OP here corresponds to Operations i guess? If I want to estimate something like compute to global memory access (CGMA), should I use flop_count_sp instead of the inst_fp_32 for the compute part? Thanks in advance.

1

There are 1 answers

0
Robert Crovella On BEST ANSWER

I am wondering what is an operation and what is an instruction here? It seems like a fma is one instruction, but two operations. Whereas add and mul is one instruction and one operation.

Yes, correct. Fused-Multiply-Add instructions count as 2 operations (a multiply, plus an add). A multiply or add instruction counts as one operation.

Are there any instructions that are not counted as operations?

Yes, any instruction that does not use the single-precision (or double-precision for e.g. flop_count_dp) functional units inside the SM will not contribute any operations to these metrics (either inst or op). For example, integer instructions, or load or store instructions, will not affect these metrics. Any instruction that might have some floating point nature (e.g. conversion to/from floating point) to it but does not consist of add or multiply operations would not contribute to the op metric, I don't believe.

Also, when we talk about peak performance in TFLOP/s, the OP here corresponds to Operations i guess?

Yes

If I want to estimate something like compute to global memory access (CGMA), should I use flop_count_sp instead of the inst_fp_32 for the compute part?

I think this might be a matter of opinion. I would use instructions. A fused-multiply-add instruction, as already mentioned, counts as 2 operations, but it does not "double" the pressure on the floating point units. Therefore, when comparing a code to look at the balance between the global memory load/store activity vs. compute "pressure", I would use instructions. Again, possibly a matter of opinion.