Why does Visual Studio increment the loop pointer before dereferencing it?

Question

Why does Visual Studio increment the loop pointer before dereferencing it?

300 views Asked by japreiss At 11 September 2013 at 04:31

I checked out Visual Studio 2012's assembly output from the following SIMD code:

    float *end = arr + sz;
    float *b = other.arr;
    for (float *a = arr; a < end; a += 4, b += 4)
    {
        __m128 ax = _mm_load_ps(a);
        __m128 bx = _mm_load_ps(b);
        ax = _mm_add_ps(ax, bx);
        _mm_store_ps(a, ax);
    }

The loop body is:

$LL11@main:
    movaps  xmm1, XMMWORD PTR [eax+ecx]
    addps   xmm1, XMMWORD PTR [ecx]
    add ecx, 16                 ; 00000010H
    movaps  XMMWORD PTR [ecx-16], xmm1
    cmp ecx, edx
    jb  SHORT $LL11@main

Why increment ecx by 16, only to subtract 16 when storing to it the next line?

Original Q&A

There are 3 answers

Martin Rosenau On 11 September 2013 at 04:50

Indeed this is a bit strange.

Many compilers avoid to read a register in the instruction after it has modified because such code runs slower on some processors. Example:

; Code that runs fast:
add ecx, 16
mov esi, edi
cmp ecx, edx

; Code doing the same that may run slower:
mov esi, edi
add ecx, 16
cmp ecx, edx

For this reason compilers often change the order of the assembler instructions. However in your case this is definitely not the reason.

Maybe the optimization code of the compiler is not written 100% correctly and it therefore does this kind of "optimization".

Stefano Tommesani On 11 September 2013 at 12:25

ADDPS has a latency of 3 cycles, plus a memory load, so the following ADD, which is much quicker, will complete before the next MOVAPS, that needs the result of ADDPS in the xmm1 register, can start.

**Igor Skochinsky** · Accepted Answer · 2013-09-11T11:06:40+00:00

Well, there are basically two options here.

 add ecx, 16
 movaps XMMWORD PTR [ecx-16], xmm1 ; stall for ecx?
 cmp ecx, edx
 jb loop

or

 movaps XMMWORD PTR [ecx], xmm1
 add ecx, 16
 cmp ecx, edx ; stall for ecx?
 jb loop

In option 1 you have a potential stall between add and movaps. In option 2 you have a potential stall between add and cmp. However, there is also the issue of the execution unit used. add and cmp (=sub) use the ALU, while the [ecx-16] uses AGU (Address Generation Unit), I believe. So I suspect there might be a slight win in option 1 because ALU use is interleaved with AGU use.

TechQA.

Why does Visual Studio increment the loop pointer before dereferencing it?

There are 3 answers

Related Questions in C++

Related Questions in VISUAL-STUDIO-2012

Related Questions in ASSEMBLY

Related Questions in CODE-GENERATION

Related Questions in PIPELINING

Popular Questions

Popular Tags

Trending Questions