Where is the source of imprecise calculation in the assembler code of gcc -Ofast compared with -O3?

174 views Asked by At

The following 3 lines give imprecise results with "gcc -Ofast -march=skylake":

int32_t  i = -5;
const double  sqr_N_min_1 = (double)i * i;
1. - ((double)i * i) / sqr_N_min_1

Obviously, sqr_N_min_1 gets 25., and in the 3rd line (-5 * -5) / 25 should become 1. so that the overall result from the 3rd line is exactly 0.. Indeed, this is true for compiler options "gcc -O3 -march=skylake".

But with "-Ofast" the last line yields -2.081668e-17 instead of 0. and with other i than -5 (e.g. 6 or 7) it gets other very small positive or negative random deviations from 0.. My question is: Where exactly is the source of this imprecision?

To investigate this, I wrote a small test program in C:

#include <stdint.h>      /* int32_t */
#include <stdio.h>
#define MAX_SIZE 10

double W[MAX_SIZE];

int main( int argc, char *argv[] )
{
  volatile int32_t n = 6; /* try 6 7 or argv[1][0]-'0' */
  double           *w = W;
  int32_t          i = 1 - n;
  const int32_t    end = n - 1;
  const double     sqr_N_min_1 = (double)i * i;

  /* Here is the crucial part. The loop avoids the compiler replacing it with constants: */
  do {
    *w++ = 1. - ((double)i * i) / sqr_N_min_1;
  } while ( (i+=2) <= end );

  /* Then, show the results (only the 1st and last output line matters): */
  w = W;
  i = 1 - n;
  do {
    fprintf( stderr, "%e\n", *w++ );
  } while ( (i+=2) <= end );

  return( 0 );
}

Godbolt shows me the assembly produced by an "x86-64 gcc9.3" with the option "-Ofast -march=skylake" vs. "-O3 -march=skylake". Please, inspect the five columns of the website (1. source code, 2. assembly with "-Ofast", 3. assembly with "-O3", 4. output of 1st assembly, 5. output of 2nd assembly):

Godbolt site with five columns

As you can see the differences in the assemblies are obvious, but I can't figure out where exactly the imprecision comes from. So, the question is, which assembler instruction(s) are responsible for this?

A follow-up question is: Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?

2

There are 2 answers

0
Eric Postpischil On

From the comments, it seems that, for -O3, the compiler computes 1. - ((double)i * i) / sqr_N_min_1:

  1. Convert i to double and square it.
  2. Divide that by sqr_N_min_1.
  3. Subtract that from 1.

and, for -Ofast, computes it:

  1. Prior to the loop, calculate the reciprocal of sqr_N_min_1.
  2. Convert i to double and square it.
  3. Compute the fused multiply-subtract of 1 minus the square times the reciprocal.

The latter improves speed because it calculates the division only once, and multiplication is much faster than division in the target processors. On top of that, the fused operation is faster than a separate multiplication and subtraction.

The error occurs because the reciprocal operation introduces a rounding error that is not present in the original expression (1/25 is not exactly representable in a binary format, while 25/25 of course is). This is why the compiler does not make this optimization when it is attempting to provide strict floating-point semantics.

Additionally, simply multiplying the reciprocal by 25 would erase the error. (This is somewhat by “chance,” as rounding errors vary in complicated ways. 1./25*25 produces 1, but 1./49*49 does not.) But the fused operation produces a more accurate result (it produces the result as if the product were computed exactly, with rounding occurring only after the subtraction), so it preserves the error.

0
Peter Cordes On

Comments and another answer have pointed out the specific transformation that's happening in your case, with a reciprocal and an FMA instead of a division.

Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?

Not in general.

-Ofast is (currently) a synonym for -O3 -ffast-math.
See https://gcc.gnu.org/wiki/FloatingPointMath

Part of -ffast-math is -funsafe-math-optimizations, which as the name implies, can change numerical results. (With the goal of allowing more optimizations, like treating FP math as associative to allow auto-vectorizing the sum of an array with SIMD, and/or unrolling with multiple accumulators, or even just rearranging a sequence of operations within one expression to combine two separate constants.)

This is exactly the kind of speed-over-accuracy optimization you're asking for by using that option. If you don't want that, don't enable all of the -ffast-math sub-options, only the safe ones like -fno-math-errno / -fno-trapping-math. (See How to force GCC to assume that a floating-point expression is non-negative?)


There's no way of formulating your source to avoid all possible problems.

Possibly you could use volatile tmp vars all over the place to defeat optimization between statements, but that would make your code slower than regular -O3 with the default -fno-fast-math. And even then, calls to library functions like sin or log may resolve to versions that assume the args are finite, not NaN or infinity, because of -ffinite-math-only.

GCC issue with -Ofast? points out another effect: isnan() is optimized into a compile-time 0.