typedef float float4 __attribute__((vector_size(16)));
float4 divvs(float4 vector, float scalar) {
return vector / scalar;
}
compiles to
// x86 gcc/clang -O3
shufps xmm1, xmm1, 0
divps xmm0, xmm1
// arm gcc/clang -O3
dup v1.4s, v1.s[0]
fdiv v0.4s, v0.4s, v1.4s
// x86 gcc -O3 -ffast-math
shufps xmm1, xmm1, 0
rcpps xmm2, xmm1
mulps xmm1, xmm2
mulps xmm1, xmm2
addps xmm2, xmm2
subps xmm2, xmm1
mulps xmm0, xmm2
// x86 clang -O3 -ffast-math
movss xmm2, dword ptr [rip + .LCPI0_0] # 1.0f
divss xmm2, xmm1
shufps xmm2, xmm2, 0
mulps xmm0, xmm2
// arm gcc -O3 -ffast-math (same code as without it though)
dup v1.4s, v1.s[0]
fdiv v0.4s, v0.4s, v1.4s
// arm clang -O3 -ffast-math
fmov v2.4s, #1.00000000
dup v1.4s, v1.s[0]
fdiv v1.4s, v2.4s, v1.4s
fmul v0.4s, v0.4s, v1.4s
My understanding is that -ffast-math enables reciprocal approximation instead of division. My other understanding is that scalar division and reciprocal instructions have at most a 1 cycle latency difference from their respective vector counterparts (Intel's intrinsics guide says so, on arm it's a bit harder but this and my own benchmarks agree on Apple Silicon at least).
My questions are:
- What's going on with the x86 gcc version? What I'm thinking is that
rcppson its own has too much error for -ffast-math, and whatever gcc does here gets it down below that threshold. I can't quite wrap my head around why it multiplies by it's own reciprocal though, nor the code that follows. I'm pretty curious on the math. - Aren't both clang versions just unconditionally worse than the one without -ffast-math? Unfortunately I don't have an Intel machine handy to benchmark. The arm clang version takes 25% longer on my M1 Mac though (putting
fmovoutside the loop naturally makes little difference;fdiv s0andfdiv v0.4sare exactly the same on their own). Intel's guide saysdivssanddivpshave the same latency too, so even if reciprocal estimate instructions didn't exist, how could it not be better in all respects to justshufpsanddivps? Just in principle, why does -ffast-math put clang through all this pain to multiply by reciprocal rather than divide, when the reciprocal comes from a division? - Why doesn't either arm version use
frecpe(arm'srcpps)? It's 10x faster. I checked -march=armv9.3-a, it isn't that. The only thing I can think of is that there's some standard for accuracy -ffast-math still has to meet, andfrecpedoesn't meet it. But even then, assuming my theory in question #1 is correct, the only way gcc could justify the inconsistency between x86 and arm is if it somehow took fewer resources tofdivthan it does to get the error fromfrecpedown to the error inxmm2by the end of the x86 version, presumably becausercppsis more accurate. That definitely doesn't sound right. Unfortunately I can't find error numbers forfrecpeanywhere. - If arm clang must insist on
fdivthenfmul, then can't it skip thedupand justfmul v0.4s, v0.4s, v1.s[0]. What's wrong with that? - To be intentionally vague, what's the "best" code to do this?