Why is the Compute Throughput’s value different from the actual Performance / Peak Performance?

Question

Why is the Compute Throughput’s value different from the actual Performance / Peak Performance?

824 views Asked by TherLF At 11 September 2022 at 14:15

I want to build a roofline model for my kernels. So I launch the ncu with the command

ncu --csv --target-processes all --set roofline mpirun -n 1 ./run_pselinv_linux_release_v2.0 -H H3600.csc -file ./tmpfile

The roofline set collects enough data to build the roofline model. But I can't figure out the meaning of each metrics clearly.

The Compute(SM) Throughput is collected by the metrics sm__throughput.avg.pct_of_peak_sustained_elapsed which is 0.64%. And I think it is the percentage of Peak Performance. But when I divide the Performance(6855693348.37) by the Peak Work(5080428410372), I get 0.13%, which is much lower than 0.64%.

Besides, I want to collect the FLOPS and memory usage in my kernel, not their throughput.

So my question is:

What is the real meaning of SM Throughput and Memory Throughput? Are they the percentage of Peak Work and Peak Traffic? By the way, Peak Work and Peak Traffic are Peak Performance and Peak Bandwidth of DRAM respectively, right?
To get the real FLOPS and memory usage of my kernel, I want to multiply the Compute(SM) Throughput and Peak Work to get the real time Performance. Then I multiply the real time Performance and elapsed time to get the FLOPS. So does to memory usage. Is my method correct?

I have been searching for this question for two days but still can't get a clear answer.

Original Q&A

There are 1 answers

**TherLF** · Accepted Answer · 2022-09-15T07:21:39+00:00

I find the answer from this question: Terminology used in Nsight Compute In short, the SM Throughput and the Memory Throughput is the maximum of a series of metrics respectively. So I just tried to understand their meanings by their name, which is totally wrong.

By the way, the correct way to collects FLOPS and memory usage of your model is in this lab: Roofline Model on NVIDIA GPUs The methodology this lab

Time:

sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second

FLOPs:

DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_dfma_pred_on.sum + sm__sass_thread_inst_executed_op_dmul_pred_on.sum

SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_ffma_pred_on.sum + sm__sass_thread_inst_executed_op_fmul_pred_on.sum

HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_hfma_pred_on.sum + sm__sass_thread_inst_executed_op_hmul_pred_on.sum

Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum

Bytes:

DRAM: dram__bytes.sum

L2: lts__t_bytes.sum

L1: l1tex__t_bytes.sum

TechQA.

Why is the Compute Throughput’s value different from the actual Performance / Peak Performance?

There are 1 answers

Related Questions in CUDA

Related Questions in GPU

Related Questions in PROFILING

Related Questions in NVIDIA

Related Questions in NSIGHT-COMPUTE

Popular Questions

Trending Questions