I checked out Visual Studio 2012's assembly output from the following SIMD code:
float *end = arr + sz;
float *b = other.arr;
for (float *a = arr; a < end; a += 4, b += 4)
{
__m128 ax = _mm_load_ps(a);
__m128 bx = _mm_load_ps(b);
ax = _mm_add_ps(ax, bx);
_mm_store_ps(a, ax);
}
The loop body is:
$LL11@main:
movaps xmm1, XMMWORD PTR [eax+ecx]
addps xmm1, XMMWORD PTR [ecx]
add ecx, 16 ; 00000010H
movaps XMMWORD PTR [ecx-16], xmm1
cmp ecx, edx
jb SHORT $LL11@main
Why increment ecx
by 16, only to subtract 16 when storing to it the next line?
Well, there are basically two options here.
or
In option 1 you have a potential stall between
add
andmovaps
. In option 2 you have a potential stall betweenadd
andcmp
. However, there is also the issue of the execution unit used.add
andcmp
(=sub
) use the ALU, while the[ecx-16]
uses AGU (Address Generation Unit), I believe. So I suspect there might be a slight win in option 1 because ALU use is interleaved with AGU use.