Why does Visual Studio increment the loop pointer before dereferencing it?

307 views Asked by At

I checked out Visual Studio 2012's assembly output from the following SIMD code:

    float *end = arr + sz;
    float *b = other.arr;
    for (float *a = arr; a < end; a += 4, b += 4)
    {
        __m128 ax = _mm_load_ps(a);
        __m128 bx = _mm_load_ps(b);
        ax = _mm_add_ps(ax, bx);
        _mm_store_ps(a, ax);
    }

The loop body is:

$LL11@main:
    movaps  xmm1, XMMWORD PTR [eax+ecx]
    addps   xmm1, XMMWORD PTR [ecx]
    add ecx, 16                 ; 00000010H
    movaps  XMMWORD PTR [ecx-16], xmm1
    cmp ecx, edx
    jb  SHORT $LL11@main

Why increment ecx by 16, only to subtract 16 when storing to it the next line?

3

There are 3 answers

0
Igor Skochinsky On BEST ANSWER

Well, there are basically two options here.

 add ecx, 16
 movaps XMMWORD PTR [ecx-16], xmm1 ; stall for ecx?
 cmp ecx, edx
 jb loop

or

 movaps XMMWORD PTR [ecx], xmm1
 add ecx, 16
 cmp ecx, edx ; stall for ecx?
 jb loop

In option 1 you have a potential stall between add and movaps. In option 2 you have a potential stall between add and cmp. However, there is also the issue of the execution unit used. add and cmp (=sub) use the ALU, while the [ecx-16] uses AGU (Address Generation Unit), I believe. So I suspect there might be a slight win in option 1 because ALU use is interleaved with AGU use.

0
Martin Rosenau On

Indeed this is a bit strange.

Many compilers avoid to read a register in the instruction after it has modified because such code runs slower on some processors. Example:

; Code that runs fast:
add ecx, 16
mov esi, edi
cmp ecx, edx

; Code doing the same that may run slower:
mov esi, edi
add ecx, 16
cmp ecx, edx

For this reason compilers often change the order of the assembler instructions. However in your case this is definitely not the reason.

Maybe the optimization code of the compiler is not written 100% correctly and it therefore does this kind of "optimization".

0
Stefano Tommesani On

ADDPS has a latency of 3 cycles, plus a memory load, so the following ADD, which is much quicker, will complete before the next MOVAPS, that needs the result of ADDPS in the xmm1 register, can start.