I am evaluating the usage (clearing and querying) of Floating-Point Exceptions in performance-critical/"hot" code. Looking at the binary produced I noticed that neither GCC nor Clang expand the call to an inline sequence of instructions that I would expect; instead they seem to generate a call to the runtime library. This is prohibitively expensive for my application.
Consider the following minimal example:
#include <fenv.h>
#pragma STDC FENV_ACCESS on
inline int fetestexcept_inline(int e)
{
unsigned int mxcsr;
asm volatile ("vstmxcsr" " %0" : "=m" (*&mxcsr));
return mxcsr & e & FE_ALL_EXCEPT;
}
double f1(double a)
{
double r = a * a;
if(r == 0 || fetestexcept_inline(FE_OVERFLOW)) return -1;
else return r;
}
double f2(double a)
{
double r = a * a;
if(r == 0 || fetestexcept(FE_OVERFLOW)) return -1;
else return r;
}
And the output as produced by GCC: https://godbolt.org/z/jxjzYY
The compiler seems to know that he can use the CPU-family-dependent AVX-instructions for the target (it uses "vmulsd" for the multiplication). However, no matter which optimization flags I try, it will always produce the much more expensive function call to glibc rather than the assembly that (as far as I understand) should do what the corresponding glibc function does.
This is not intended as a complaint, I am OK with adding the inline assembly. I just wonder whether there might be a subtle difference that I am overlooking that could be a bug in the inline-assembly-version.
It's required to support
long double
arithmetic.fetestexcept
needs to merge the SSE and FPU states becauselong double
operations only update the FPU state, but not the MXSCR register. Therefore, the benefit from inlining is somewhat reduced.