I have a simple loop with takes the product of n complex numbers. As I perform this loop millions of times I want it to be as fast as possible. I understand that it's possible to do this quickly using SSE3 and gcc intrinsics like _mm_addsub_ps
but I'm interested in whether it's possible to get gcc to auto-vectorize code like this, a product of complex numbers:
#include <complex.h>
complex float f(complex float x[], int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
The assembly you get from gcc -S -O3 -ffast-math
is:
.file "test.c"
.section .text.unlikely,"ax",@progbits
.LCOLDB2:
.text
.LHOTB2:
.p2align 4,,15
.globl f
.type f, @function
f:
.LFB0:
.cfi_startproc
testl %esi, %esi
jle .L4
leal -1(%rsi), %eax
pxor %xmm2, %xmm2
movss .LC1(%rip), %xmm3
leaq 8(%rdi,%rax,8), %rax
.p2align 4,,10
.p2align 3
.L3:
movaps %xmm3, %xmm5
movaps %xmm3, %xmm4
movss (%rdi), %xmm0
addq $8, %rdi
movss -4(%rdi), %xmm1
mulss %xmm0, %xmm5
mulss %xmm1, %xmm4
cmpq %rdi, %rax
mulss %xmm2, %xmm0
mulss %xmm2, %xmm1
movaps %xmm5, %xmm3
movaps %xmm4, %xmm2
subss %xmm1, %xmm3
addss %xmm0, %xmm2
jne .L3
movaps %xmm2, %xmm1
.L2:
movss %xmm3, -8(%rsp)
movss %xmm1, -4(%rsp)
movq -8(%rsp), %xmm0
ret
.L4:
movss .LC1(%rip), %xmm3
pxor %xmm1, %xmm1
jmp .L2
.cfi_endproc
.LFE0:
.size f, .-f
.section .text.unlikely
.LCOLDE2:
.text
.LHOTE2:
.section .rodata.cst4,"aM",@progbits,4
.align 4
.LC1:
.long 1065353216
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
.section .note.GNU-stack,"",@progbits
The problem is that the
complex
type is not SIMD friendly. I have never been a fan of thecomplex
type because it's a composite object that usually does not map to a primitive type or single operation in hardware (certainly not with x86 hardware).In order to make complex arithmetic SIMD friendly you need to operate on multiple complex numbers simultaneous. For SSE you need to operate on four complex numbers at once.
We can use GCC's vector extensions to make the syntax easier.
Then we can delcare a union of an array and the vector extension
And lastly we define a block of four complex numbers like this
where
x
is four real parts andy
is four imaginary components.Once we have this we can multiple 4 complex numbers at once like this
and finally we get to your function modified to operate on four complex numbers at a time.
Let's look at the assembly (Intel syntax) to see if it's optimal
That's exactly four 4-wide multiplications, one 4-wide addition, and one 4-wide subtraction. The variable
p
stays in register and only the arrayx
is loaded from memory just like we want.Let's look at the algebra for the product of complex numbers
That's exactly four multiplications, one addition, and one subtraction.
As I explained in this answer efficient SIMD algebraically is often identical to the scalar arithmetic. So we have replaced four 1-wide multiplications, addition, and subtraction, with four 4-wide multiplications, addition, and subtraction. That's the best you can do with 4-wide SIMD: four for the price of one.
Note that this does not need any instructions beyond SSE and no additional SSE instructions (except for FMA4) will be any better. So on a 64-bit system you can compile with
-O3
.It is trivial to extend this for 8-wide SIMD with AVX.
One major advantage of using GCC's vector extensions is you get FMA without any additional effort. E.g. if you compile with
-O3 -mfma4
the main loop is