I'm trying to optimise a part of code that is called within a parallel region (OpenMP). I did a memory access analyses with Intel VTune Amplifier 2015 and am a bit confused about the result. I repeated the analyses with optimization level O1, O2 and O3 with Intel Composer 2015, but the outcome is the same. The Amplifier claims, that most LLC misses appear in the following three lines:
__attribute__ ((aligned(64))) double x[4] = {1.e0,-1.e0, 0.e0, 0.e0};
__attribute__ ((aligned(64))) double y[4] = {0.e0,-1.e0, 1.e0, 0.e0};
__attribute__ ((aligned(64))) double z[4] = {0.e0, 0.e0,-1.e0, 1.e0};
The data is aligned because it is accessed later in vectorized code. I can't publish the whole code here, because it has copyright. This are about 75% of the total cache misses within this function, although there are lots of calculations and other arrays later in the code. For O0-optimization I get much more realistic results, because there where lines like
res[a] += tempres[start + b] * fact;
But there the whole execution needs much more time (which is clear). But which results can I trust? Or which alternative software can I use for testing.
Thanks in advance!
Looking only at percentages can be misleading (75% of 100 is less than 10% of 1000) - you'll need to look at the absolute number of misses when you compare.
Cache behaviour is also difficult to intuit, particularly in combination with compiler optimisations and CPU pipelines.
It looks like the optimised builds mostly miss the cache on initialisation (not too surprising that it does) but manage to keep almost the entire computation in-cache, so I don't see a problem here.
If you want to be sure, you'll need to study the generated assembly and the reference manuals for your hardware.
Searching for a tool that confirms your expectation is largely a waste of time, as you can't be sure that that tool isn't the one that's in error.