In my code I solve integral
y=x^2-4x+6
I used SSE - it allows me to operate on 4 values in one time. I made program which solve this integral with values from 0 to 5 divided to five 4-element vectors n1, n2, n3, n4.
.data
n1: .float 0.3125,0.625,0.9375,1.25
n2: .float 1.5625,1.875,2.1875,2.5
n3: .float 2.8125,3.12500,3.4375,3.75
n4: .float 4.0625,4.37500,4.6875,5
szostka: .float 6,6,6,6
czworka: .float 4,4,4,4
.text
.global main
main:
movups (n1),%xmm0
mulps %xmm0,%xmm0
movups (szostka),%xmm2
addps %xmm2,%xmm0
movups (n1),%xmm1
movups (czworka),%xmm2
mulps %xmm2,%xmm1
subps %xmm1,%xmm0
movups %xmm0,%xmm7
movups (n2),%xmm0
mulps %xmm0,%xmm0
movups (szostka),%xmm2
addps %xmm2,%xmm0
movups (n1),%xmm1
movups (czworka),%xmm2
mulps %xmm2,%xmm1
subps %xmm1,%xmm0
movups %xmm0,%xmm6
movups (n3),%xmm0
mulps %xmm0,%xmm0
movups (szostka),%xmm2
addps %xmm2,%xmm0
movups (n1),%xmm1
movups (czworka),%xmm2
mulps %xmm2,%xmm1
subps %xmm1,%xmm0
movups %xmm0,%xmm5
movups (n4),%xmm0
mulps %xmm0,%xmm0
movups (szostka),%xmm2
addps %xmm2,%xmm0
movups (n1),%xmm1
movups (czworka),%xmm2
mulps %xmm2,%xmm1
subps %xmm1,%xmm0
movups %xmm0,%xmm4
mov $1,%eax
mov $0,%ebx
int $0x80
In the end, I have 4 vectors in registers xmm7, xmm6, xmm5, xmm4. To solve integral, I need to add vectors to each other (which is easy) and then add values from vector also to each other.
How should I do this?
As Paul R said in a comment, you can use
haddps
for horizontal ops within a vector, at the end.Your code looks inefficient. If you're going to fully unroll, instead of using a loop and an accumulator, you can use a different register in the first place for each copy, instead of having a
movups %xmm0,%xmmX
at the end of every block.Also, keep
(szostka)
and(czworka)
in a register across iterations. Don't reload them every time. Similarly, replacemovups (n1),%xmm1
withmovups %xmm0, %xmm1
(before you square%xmm0
). On IvyBridge and later, the register-renaming stage handles reg-reg moves, and they happen with zero latency.If you did need to load
(szostka)
every time, it would be better to useaddps
with a memory operand, instead of a separate move and add. Micro-fusion could keep that operation as a single uop.Check out http://agner.org/optimize/ for docs on how to optimize assembly. You might find it more useful to use intrinsics, to let the compiler take care of small details like register allocation, instead of writing in asm directly.