Is there a way to profile a MPI program with detailed cache/CPU efficiency information?

1k views Asked by At

OS: Ubuntu 18.04 Question: How to profile a multi-process program?

I usually use GNU perf tool to profile a program as follows: perf stat -d ./main [args], and this command will return a detailed performance counter as follows:

         47,455.09 msec task-clock                #    8.602 CPUs utilized          
           129,199      context-switches          #    0.003 M/sec                  
                92      cpu-migrations            #    0.002 K/sec                  
            16,228      page-faults               #    0.342 K/sec                  
   117,757,409,457      cycles                    #    2.481 GHz                      (49.84%)
   236,496,093,412      instructions              #    2.01  insn per cycle           (62.31%)
     1,454,901,353      branches                  #   30.658 M/sec                    (62.18%)
         6,168,091      branch-misses             #    0.42% of all branches          (62.30%)
   183,462,410,176      L1-dcache-loads           # 3866.021 M/sec                    (62.55%)
       189,736,991      L1-dcache-load-misses     #    0.10% of all L1-dcache hits    (62.75%)
         8,330,520      LLC-loads                 #    0.176 M/sec                    (50.14%)
           628,142      LLC-load-misses           #    7.54% of all LL-cache hits     (50.25%)

       5.516529249 seconds time elapsed

      46.947476000 seconds user
       0.989185000 seconds sys

What I focus on is CPU Efficiency (Line 1), IPC (Line 6), L1, and LLC Bandwidth (Line 9 and 11).

But now, I need to profile every process of an MPI program, assume that we have 3 processes by executing mpiexec -np 3 ./main [args], how can I get the CPU Efficiency, IPC, L1, and LLC info of every process respectively? (By using perf stat -d, I only get overall information containing 3 processes, which is currently not enough for me)

The output I want is like this:

PID: 1
LLC Band.: xxx

PID: 2
LLC Band.: xxx

PID: 3
LLC Band.: xxx

How can I do this? (I wonder can GNU gperf do this? Or is there some C++ way to do this?)

1

There are 1 answers

0
dabo42 On BEST ANSWER

Basic profilers like gperf or gprof don't work well with MPI programs, but there are many profiling tools specifically designed to work with MPI that collect and report data for each MPI rank. Virtually all of them can collect hardware performance counters for cache misses. Here are a few options:

  • HPCToolkit for sampling-based profiling. Works on unmodified binaries.
  • TAU and Score-P provide instrumentation-based profiling. Usually requires recompiling.
  • TiMemory and Caliper let you mark code regions to measure. TiMemory also has scripts for roofline analysis etc.

Decent HPC centers typically have one or more of them installed. Refer to the manuals to learn how to gather hardware counters.