How to add values from vector to each other

160 views Asked by At

In my code I solve integral

y=x^2-4x+6

I used SSE - it allows me to operate on 4 values in one time. I made program which solve this integral with values from 0 to 5 divided to five 4-element vectors n1, n2, n3, n4.

.data
n1: .float 0.3125,0.625,0.9375,1.25
n2: .float 1.5625,1.875,2.1875,2.5
n3: .float 2.8125,3.12500,3.4375,3.75
n4: .float 4.0625,4.37500,4.6875,5
szostka: .float 6,6,6,6
czworka: .float 4,4,4,4
.text
.global main
main:  
        movups (n1),%xmm0

        mulps %xmm0,%xmm0
        movups (szostka),%xmm2
        addps %xmm2,%xmm0
        movups (n1),%xmm1
        movups (czworka),%xmm2
        mulps %xmm2,%xmm1
        subps %xmm1,%xmm0
        movups %xmm0,%xmm7

        movups (n2),%xmm0

        mulps %xmm0,%xmm0
        movups (szostka),%xmm2
        addps %xmm2,%xmm0
        movups (n1),%xmm1
        movups (czworka),%xmm2
        mulps %xmm2,%xmm1
        subps %xmm1,%xmm0
        movups %xmm0,%xmm6

        movups (n3),%xmm0

        mulps %xmm0,%xmm0
        movups (szostka),%xmm2
        addps %xmm2,%xmm0
        movups (n1),%xmm1
        movups (czworka),%xmm2
        mulps %xmm2,%xmm1
        subps %xmm1,%xmm0
        movups %xmm0,%xmm5

        movups (n4),%xmm0

        mulps %xmm0,%xmm0
        movups (szostka),%xmm2
        addps %xmm2,%xmm0
        movups (n1),%xmm1
        movups (czworka),%xmm2
        mulps %xmm2,%xmm1
        subps %xmm1,%xmm0
        movups %xmm0,%xmm4

        mov $1,%eax
        mov $0,%ebx
        int $0x80 

In the end, I have 4 vectors in registers xmm7, xmm6, xmm5, xmm4. To solve integral, I need to add vectors to each other (which is easy) and then add values from vector also to each other.
How should I do this?

1

There are 1 answers

2
Peter Cordes On

As Paul R said in a comment, you can use haddps for horizontal ops within a vector, at the end.

Your code looks inefficient. If you're going to fully unroll, instead of using a loop and an accumulator, you can use a different register in the first place for each copy, instead of having a movups %xmm0,%xmmX at the end of every block.

Also, keep (szostka) and (czworka) in a register across iterations. Don't reload them every time. Similarly, replace movups (n1),%xmm1 with movups %xmm0, %xmm1 (before you square %xmm0). On IvyBridge and later, the register-renaming stage handles reg-reg moves, and they happen with zero latency.

If you did need to load (szostka) every time, it would be better to use addps with a memory operand, instead of a separate move and add. Micro-fusion could keep that operation as a single uop.

Check out http://agner.org/optimize/ for docs on how to optimize assembly. You might find it more useful to use intrinsics, to let the compiler take care of small details like register allocation, instead of writing in asm directly.