Why this AVX2 slowdown with FMA x86 MS C Compiler?

84 views Asked by At

These are three classical starters for solving Kepler's equation and S3 provides an interesting example of deceptively short code fragments with odd timing behaviour. This problem concerns compiling them on the Microsoft C compiler x86 mode where anomalous behaviour is observed - namely that the AVX2 code at maximum optimisation for the notionally simpler code S3 takes much longer to execute than S9. No such problem with Intel (can't tell with GCC since x87 sin problem in Mingw prevents me from testing it). All starters are functions of eccentricity 0 <= e <= 1 and Mean anomaly -pi <= M <= pi These classical ones are by Gooding & Odel (1986)

double S1(double e, double M)
{ 
  if(M>0) return M + e; else return M - e;
  // simplest starter guaranteed to converge (eventually)
}
// convergence is not guaranteed with either S3 or S9 when e -> 1

double Gooding_S3(double e, double M)
{
  return M + e*sin(M)*(1+e*cos(M));  
// slow and fairly useless S9 is much more accurate!
}

double Gooding_S9(double e, double M)
{
 // originally written in their paper as sin(M)/sqrt(1-2*e*cos(M)+e*e)
 // it is one form of Halley's method but derived as the root of a
 // simple quadratic approximation for E-M. Rather good and fast e<0.7!
  double y;
  y = 1-e;
  if (M == 0.0) return M; // defend against divide by zero
  s = sin(M/2);
  return M+e*sin(M)/sqrt(y*y+4*e*s*s);  // more accurate form (very wrong when e -> 1)
}

A quick inspection of the number of operations and their nature suggests that they are in order of increasing execution time, but this isn't always the case! S9 is sometimes faster. Their operation counts subject to my counting errors are

  1. S1 +
  2. S3 ++ *** sincos
  3. S9 -++ ***** / sin sin sqrt

Using standard _cdecl calling convention x86 code generation and MSC 2022 compiler today I get timings as follows for S3 and S9 (S1 is ~8 cycles from function call overheads). The 74 for AVX2 is not a typo! Problem goes away if code generation is x64.

Starter x87 SSE2 AVX AVX2 *_64
S3 199 23 24 74 23
S9 206 50 48 47 30

I have isolated the small code fragments that differ from S3 between AVX and AVX2 on x86 codegen but cannot see why one should be so much worse than the other. I'm also a bit amazed that S9 is quite so fast (and timing stable). x64 code is consistent and faster so I remain puzzled by it. I ran into it again through generating x87 trig code for the other question.

This is the much faster AVX code for S3 from sincos (no FMA)

call        ___libm_sse2_sincos_ (0BF7D50h)    
vshufpd     xmm1,xmm0,xmm0,1  
vmulsd      xmm1,xmm1,mmword ptr [e]  
vmulsd      xmm0,xmm0,mmword ptr [e]  
vaddsd      xmm1,xmm1,mmword ptr [__real@3ff0000000000000 (0BFBFB8h)]  
vmulsd      xmm0,xmm1,xmm0  
vaddsd      xmm0,xmm0,mmword ptr [M]  
vmovsd      qword ptr [esp],xmm0  
fld         qword ptr [esp]  

And this is the slow S3 code from AVX2 using FMA apparently badly. I don't understand why.

 call        ___libm_sse2_sincos_ (0672600h)  
 vmovsd      xmm2,qword ptr [e]  
 vpermilpd   xmm1,xmm0,1  
 vfmadd213sd xmm2,xmm1,mmword ptr [__real@3ff0000000000000 (0676F50h)]  
 vmulsd      xmm0,xmm0,mmword ptr [e]  
 vfmadd213sd xmm0,xmm2,mmword ptr [M]  
 vmovsd      qword ptr [esp],xmm0  
 fld         qword ptr [esp] 2022 

Why is the AVX2 code so much slower than AVX/SSE2 and why does it only happen for x86 code?

I'd be interested in suggestions for avoiding this sort of odd slowdown where shorter smarter code from an improved instruction set might result in worse performance. It seems to be much less of a problem with x64 code generation which might hint at it possibly being related to the old x86 linkage convention of returning FP results in ST(0).

It seems to be associated with transcendental library function calls but that may just be a sampling bias on my part.

Thanks for any enlightenment.

0

There are 0 answers