These are three classical starters for solving Kepler's equation and S3 provides an interesting example of deceptively short code fragments with odd timing behaviour. This problem concerns compiling them on the Microsoft C compiler x86 mode where anomalous behaviour is observed - namely that the AVX2 code at maximum optimisation for the notionally simpler code S3 takes much longer to execute than S9. No such problem with Intel (can't tell with GCC since x87 sin problem in Mingw prevents me from testing it). All starters are functions of eccentricity 0 <= e <= 1 and Mean anomaly -pi <= M <= pi These classical ones are by Gooding & Odel (1986)
double S1(double e, double M)
{
if(M>0) return M + e; else return M - e;
// simplest starter guaranteed to converge (eventually)
}
// convergence is not guaranteed with either S3 or S9 when e -> 1
double Gooding_S3(double e, double M)
{
return M + e*sin(M)*(1+e*cos(M));
// slow and fairly useless S9 is much more accurate!
}
double Gooding_S9(double e, double M)
{
// originally written in their paper as sin(M)/sqrt(1-2*e*cos(M)+e*e)
// it is one form of Halley's method but derived as the root of a
// simple quadratic approximation for E-M. Rather good and fast e<0.7!
double y;
y = 1-e;
if (M == 0.0) return M; // defend against divide by zero
s = sin(M/2);
return M+e*sin(M)/sqrt(y*y+4*e*s*s); // more accurate form (very wrong when e -> 1)
}
A quick inspection of the number of operations and their nature suggests that they are in order of increasing execution time, but this isn't always the case! S9 is sometimes faster. Their operation counts subject to my counting errors are
- S1 +
- S3 ++ *** sincos
- S9 -++ ***** / sin sin sqrt
Using standard _cdecl calling convention x86 code generation and MSC 2022 compiler today I get timings as follows for S3 and S9 (S1 is ~8 cycles from function call overheads). The 74 for AVX2 is not a typo! Problem goes away if code generation is x64.
Starter | x87 | SSE2 | AVX | AVX2 | *_64 |
---|---|---|---|---|---|
S3 | 199 | 23 | 24 | 74 | 23 |
S9 | 206 | 50 | 48 | 47 | 30 |
I have isolated the small code fragments that differ from S3 between AVX and AVX2 on x86 codegen but cannot see why one should be so much worse than the other. I'm also a bit amazed that S9 is quite so fast (and timing stable). x64 code is consistent and faster so I remain puzzled by it. I ran into it again through generating x87 trig code for the other question.
This is the much faster AVX code for S3 from sincos (no FMA)
call ___libm_sse2_sincos_ (0BF7D50h)
vshufpd xmm1,xmm0,xmm0,1
vmulsd xmm1,xmm1,mmword ptr [e]
vmulsd xmm0,xmm0,mmword ptr [e]
vaddsd xmm1,xmm1,mmword ptr [__real@3ff0000000000000 (0BFBFB8h)]
vmulsd xmm0,xmm1,xmm0
vaddsd xmm0,xmm0,mmword ptr [M]
vmovsd qword ptr [esp],xmm0
fld qword ptr [esp]
And this is the slow S3 code from AVX2 using FMA apparently badly. I don't understand why.
call ___libm_sse2_sincos_ (0672600h)
vmovsd xmm2,qword ptr [e]
vpermilpd xmm1,xmm0,1
vfmadd213sd xmm2,xmm1,mmword ptr [__real@3ff0000000000000 (0676F50h)]
vmulsd xmm0,xmm0,mmword ptr [e]
vfmadd213sd xmm0,xmm2,mmword ptr [M]
vmovsd qword ptr [esp],xmm0
fld qword ptr [esp] 2022
Why is the AVX2 code so much slower than AVX/SSE2 and why does it only happen for x86 code?
I'd be interested in suggestions for avoiding this sort of odd slowdown where shorter smarter code from an improved instruction set might result in worse performance. It seems to be much less of a problem with x64 code generation which might hint at it possibly being related to the old x86 linkage convention of returning FP results in ST(0).
It seems to be associated with transcendental library function calls but that may just be a sampling bias on my part.
Thanks for any enlightenment.