I'm working on writing an OpenCL benchmark in C. Currently, it measures the fused multiply-accumulate performance of both a CL device, and the system's processor using C code. The results are then cross checked for accuracy.
I wrote the native code to take advantage of GCC's auto vectorizer, and it works. However, I've noticed that GCC has some odd behavior with the "-march=native" flag.
This is my loop:
#define BUFFER_SIZE_SQRT 4096
#define SQUARE(n) (n * n)
#define ROUNDS_PER_ITERATION 48
static float* cpu_result_matrix(const float* a, const float* b, const float* c)
{
float* res = aligned_alloc(16, SQUARE(BUFFER_SIZE_SQRT) * sizeof(float));
const unsigned buff_size = SQUARE(BUFFER_SIZE_SQRT);
const unsigned round_cnt = ROUNDS_PER_ITERATION;
float lres;
for(unsigned i = 0; i < buff_size; i++)
{
lres = 0;
for(unsigned j = 0; j < round_cnt; j++)
{
lres += a[i] * ((b[i] * c[i]) + b[i]);
lres += b[i] * ((c[i] * a[i]) + c[i]);
lres += c[i] * ((a[i] * b[i]) + a[i]);
}
res[i] = lres;
}
return res;
}
When I compile with "-march=native -Ofast" on a Broadwell system, I get nice vectorized AVX code.
.L19:
vmovups ymm0, YMMWORD PTR [rcx+rdx]
mov eax, 48
vmovups ymm2, YMMWORD PTR [rdi+rdx]
vaddps ymm1, ymm0, ymm5
vmovups ymm3, YMMWORD PTR [rsi+rdx]
vaddps ymm4, ymm2, ymm5
vmulps ymm1, ymm1, ymm2
vfmadd132ps ymm4, ymm1, ymm0
vaddps ymm1, ymm3, ymm5
vmulps ymm0, ymm2, ymm0
vmulps ymm0, ymm0, ymm1
vfmadd132ps ymm4, ymm0, ymm3
vmovaps ymm1, ymm4
vxorps xmm0, xmm0, xmm0
.p2align 4,,10
.p2align 3
Compiling with the same flags on a Piledriver system emits SSE2 instructions, but no AVX instructions, even though the architecture supports it. (I'll clarify my title here by saying that Broadwell and Piledriver are nothing alike, but they both support similar vector instruction set extensions, so the emitted code should be similar.)
.L19:
mov eax, 48
movups xmm0, XMMWORD PTR [rcx+rdx]
movups xmm2, XMMWORD PTR [r13+0+rdx]
movaps xmm4, xmm0
movaps xmm1, xmm2
movups xmm3, XMMWORD PTR [rsi+rdx]
addps xmm4, xmm5
addps xmm1, xmm5
mulps xmm4, xmm2
mulps xmm1, xmm0
mulps xmm0, xmm2
addps xmm1, xmm4
movaps xmm4, xmm1
mulps xmm4, xmm3
addps xmm3, xmm5
mulps xmm0, xmm3
addps xmm4, xmm0
pxor xmm0, xmm0
movaps xmm1, xmm4
.p2align 4,,10
.p2align 3
I can even compile the whole project with -march=broadwell, and run it on the Piledriver system, and it works, with a ~100% performance gain.
I'm compiling with GCC 5.1.0, and "-ftree-vectorizer-verbose" doesn't seem to work anymore, so the compiler's behavior is quite opaque. I haven't found any information about the flag being deprecated, so I'm not sure why it doesn't work anymore, and I'd really like to figure out what GCC is doing.
The whole project is here: https://github.com/jakogut/clperf/tree/v0.1
AVX is disabled because the entire AMD Bulldozer family does not handle 256-bit AVX instructions efficiently. Internally, the execution units are only 128-bit wide. So 256-bit operations are split up thereby providing no benefit over 128-bit.
To add insult to injury, on Piledriver, there's a bug in the 256-bit store that reduces the throughput to about 1 every 17 cycles.
Your test case seems to be an anomaly. You don't have 256-bit stores in that critical loop - which avoids the bug. This (theoretically) leaves SSE on par with AVX for Piledriver.
The tie-breaker comes from the FMA3 instructions which Piledriver supports. This is probably why the AVX loop does become faster on Piledriver.
One thing you can try is
-mfma4
-mtune=bdver2
and see what happens.