SIMD performance does not look right

89 views Asked by At

I have been playing around with performance improvements to basic loops on my local machine. In Summary i have 2 big slices of float32's and want to get the best improvement for multiplying them together, using any means possible. For reference i have a 3.7Ghz AMD 12 core, running at roughly 4.1Ghz

First the basic implementation of a single mul inside the loop yields: 4.2B ops / second

Basic loop unrolling yielded the same result (go compiler standard optimisations looks to be unrolling for me):

for i := 0; i < len(a); i += 4 {
    s0 := a[i] * b[i]
    s1 := a[i+1] * b[i+1]
    s2 := a[i+2] * b[i+2]
    s3 := a[i+3] * b[i+3]
    sum += s0 + s1 + s2 + s3
}

If i disable out of bounds checks in the compiler i see a large improvement to: 8.2B ops / second. The Issue is this is a safety measure defaulted by the compiler so i needed a way to make the compiler know it did not need to perform out of bounds checks, this can be done with slice capacity checks inside the loop, and gave a performance of 7.6B ops / second:

for i := 0; i < len(a) && i < len(b); i += 4 {
    aTmp := a[i : i+4 : i+4]
    bTmp := b[i : i+4 : i+4]
    s0 := aTmp[0] * bTmp[0]
    s1 := aTmp[1] * bTmp[1]
    s2 := aTmp[2] * bTmp[2]
    s3 := aTmp[3] * bTmp[3]
    sum += s0 + s1 + s2 + s3

I next wanted to go the SIMD route, and first implemented it via the "github.com/bjwbell/gensimd/simd" lib:

for i := 0; i < len(a); i += 4 {
    a := simd.MulF32x4(simd.F32x4{a[i], a[i+1], a[i+2], a[i+3]}, simd.F32x4{b[i], b[i+1], b[i+2], b[i+3]})
    sum += a[0] + a[1] + a[2] + a[3]
}

This should in theory be be hitting the 256 wide registers to perform 4 mult on each instruction. The results show only 1.1b ops / second, so clearly something is wrong

I also did the same thing using cgo and assembly using intrinsics:

go file:

C.add_arrays((*C.float)(unsafe.Pointer(&a[0])), (*C.float)(unsafe.Pointer(&b[0])), C.int(len(a)))

c file:

void add_arrays(float* a, float* b, int len) {
__m256 va, vb, vsum;
for (int i = 0; i < len; i += 8) {
    va = _mm256_load_ps(a + i);
    vb = _mm256_load_ps(b + i);
    vsum = _mm256_mul_ps(va, vb);
    _mm256_store_ps(a + i, vsum);
}
}

which yielded 2.9B ops / second

I would expect SIMD to be a multiple faster than the unrolled version, am i coding the go implementations wrong or missing something? I am relatively new to this, so iany advice would be great.

For Reference Code is here: https://github.com/Spotnag/go_array_performance/blob/main/performance.go

1

There are 1 answers

0
spotnag On

it looks like the compiler was factoring out multiplications as the resulting sum was not held / used. After fixing all four tests to store and use the sum i am now seeing more inline results from SIMD using intrinsics . Gensimd lib just does not seem to be working even using the tool / generating / building.

Unrolled - Bil ops/second: 2.840884878822056

Unrolled no bound checking - Bil ops/second: 2.963691811615894

SIMD gensimd - Bil ops/second: 0.27859654205971995

SIMD Intrinsics - Bil ops/second: 4.534921160395626