I try to parallelize one hotspot of my program in C++ with OpenMP, but it das not scale. While it needs 25 seconds on 1 thread I only achieve 21 seconds with 2 threads. I did a Locks & Wait analysis with Intel VTune Amplifier, but it does not really help me. It looks like:
I especially do not understand where the mkl_blas_dcopy comes from and what it calling it (even if I remove my parallel region I have this call and a second thread in the timeline).
I tried to get more information out of the Top-Down Tree, but it is not really helpful for me.
An Advanced Hotspots Analyses also did not give me more information. How do I have to approach this issue in order to identify the problem?
Additional information: Before I had a much worse overall runtime, but I did lots of optimisations in the serial code and could increase the performance but after that my code does no more scale.
Many thanks in advance!
Edit: Here also the timeline, where no Transitions are shown, independent from how near I zoom in. In this case I used another testcase with 8 threads.