I am working on a project where we were asked to write a simple OpenMP code to parallelize a program that works with differential equations. We were also asked to test the performance of the code with and without compiler optimizations. I'm working with the Sun CC compiler, so for the optmized version I used the options
-xopenmp -fast
and for the non optimized
-xopenmp=noopt
Not surprisingly the running time with the compiler optimisation on was much lower than in the other case. What surprises me is that the scaling performances are much better on the non-optimised version. Here, by performance I mean the speed-up coefficient, that is the ratio of the running time of the program ran over M processors and the running time of the program ran on 1 processor.
It was hinted that this could depend on the fact that the optimised version is memory-bound, while the non optimised version is CPU-bound. I am not sure of how the "boundness" could influence the scaling capability of my code. Do you have any suggestion?
On most multi-processor systems, multiple CPU cores share a single path to memory. A given output binary will have a certain inherent computational intensity (calculations per byte accessed) per thread. When the number of cores you're running the code on lets it exceed an operation rate greater than the necessary memory bandwidth to support it, it will stop scaling with additional cores. To get a good view on how to reason about this kind of issue, look up the 'roofline model'.
There are two changes I'd expect to see from enabling optimization. One of them is that the computational intensity should increase somewhat, if the optimization provide any sort of loop blocking to reduce memory access. The other is that the raw operation rate should increase with better identification of vectorization opportunities and subsequent instruction selection and scheduling. These two things should have opposite effects on scaling efficiency, but the latter one clearly dominates in your case.