How to interpret Intel VTune Amplifier's Locks&Waits

1.2k views Asked by At

I try to parallelize one hotspot of my program in C++ with OpenMP, but it das not scale. While it needs 25 seconds on 1 thread I only achieve 21 seconds with 2 threads. I did a Locks & Wait analysis with Intel VTune Amplifier, but it does not really help me. It looks like:

Result of the VTune Amplifier

I especially do not understand where the mkl_blas_dcopy comes from and what it calling it (even if I remove my parallel region I have this call and a second thread in the timeline).

I tried to get more information out of the Top-Down Tree, but it is not really helpful for me.

enter image description here

An Advanced Hotspots Analyses also did not give me more information. How do I have to approach this issue in order to identify the problem?

Additional information: Before I had a much worse overall runtime, but I did lots of optimisations in the serial code and could increase the performance but after that my code does no more scale.

Many thanks in advance!

Edit: Here also the timeline, where no Transitions are shown, independent from how near I zoom in. In this case I used another testcase with 8 threads. enter image description here

2

There are 2 answers

3
Kirill Rogozhin On
  1. What version of VTune do you use? Looks like not the latest - frame rate for OpenMP regions that is on your screenshot is removed in current version. It worth trying new 2015 update 1, there were made some fixes and improvements for OpenMP analysis.
  2. What compiler and OpenMP runtime do you use? If it is Intel OpenMP (and compiler), VTune analysis will be much more informative for OpenMP regions. Just change grouping in Bottom-up from "Funcion/callstack" to "OpenMP region/..." - you'll find much interesting.
  3. You see mkl_blas_dcopy because you seem to use MKL functions in your code. mkl_blas_dcopy is just an internal MKL function. You can find actual MKL call in your code looking at the stack panel on the right, when "mkl_blas_dcopy" hotspot is selected in Bottom-up - you should be able see call chain up to main().
  4. MKL is already parallelized with OpenMP. It is possible that you put MKL call inside your own OpenMP region. If this is the case, it is not optimal - OpenMP is not good when nesting. You should choose, use parallel version of MKL without OpenMP, or serial MKL library inside OpenMP parallel region. You can control serial/parallel MKL setting via linking, see MKL Link Advisor: https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor
  5. Each frame in timeline on your screenshot is likely an OpenMP region from MKL. There are seem to be many parallel regions of short duration, that may indicate MKL is called from a loop. So each iteration it starts, executes and stops OpenMP parallel region. Start and Stop actions have some overhead, that counts to your big waiting time. So it may worth trying serial MKL version inside outer OpenMP loop, to avoid multiple parallel region re-entrance.
2
Kirill Rogozhin On

Transitions are shown for synchronization objects. In this case the waiting time likely comes from OpenMP runtime inside MKL library. In VTune you will see this time as overhead and spin time, in more recent versions.