Why optimization flag (-O3) doesn't speed up quadruple precision calculations?

735 views Asked by At

I have a high-precision ODE (ordinary differential equations) solver written on C++. I do all calculations with user-defined type real_type. There is a typedef declaring this type in the header:

typedef long double real_type;

I decided to change long double type to __float128 for more accuracy. In addition to this I included quadmath.h and replaced all standard math functions by ones from libquadmath.

If "long double" version is builded without any optimization flags, some reference ODE is solved in 77 seconds. If this version is builded with -O3 flag, the same ODE is solved in 25 seconds. Thus -O3 flag speeds up calculations in three times.

But in "__float 128" version builded without flags similar ODE is solved in 190 seconds, and with -O3 in 160 seconds (~ 15% difference). Why -O3 optimization does such a weak effect for quadruple precision calculations? Maybe I should use other compiler flags or include other libraries?

3

There are 3 answers

0
Sebastian Redl On BEST ANSWER

Compiler optimizations work like this: the compiler recognizes certain patterns in your code, and replaces them by equivalent, but faster versions. Without knowing exactly what your code looks like and what optimizations the compiler performs, we can't say what the compiler is missing.

It's likely that several optimizations that the compiler knows how to perform for native floating point types and their operations, it doesn't know to perform on __float128 and library implementations of the operations. It might not recognize these operations for what they are. Maybe it can't look into the library implementations (you should try compiling the library together with your program and enabling link-time optimization).

0
David Schwartz On

The same optimizations provided substantially the same benefit. The percentage went down just because the math itself took longer.

To believe the optimizations should be the same percentage, you'd have to believe that making the math take longer would somehow make the optimizer find more savings. Why would you think that?

0
Theodoros Chatzigiannakis On

If your target is the x86 architecture, then in GCC __float128 is an actual quadruple precision FP type, while long double is the x87 80-bit FP type (double extended).

It is reasonable that math with smaller precision types can be faster than math with larger precision types. It is also reasonable that math with native hardware types can be faster than math with non-native types.