For some reason serial code runs faster than SIMD code

255 views Asked by At

For some reason running the simple serial code

for(i=0;i<1152*1152;i++){
    MatrixA3[i] = MatrixA1[i] + z*MatrixA2[i];}

runs faster than or same speed with the vectorized equivalent;

for (int i = 0; i < 1152*1152; i+=4){
    load_data1 = _mm256_load_pd(MatrixA1 + i);
    load_data2 = _mm256_load_pd(MatrixA2 + i);
    _mm256_store_pd(MatrixA3 + i, _mm256_fmadd_pd(load_z,
    load_data2,load_data1_dp));
    }

On my intel i7-4578U with Intel compiler XE 15.0, the former runs in 1.507millesecs while the later finished in 1.513millisecs with 10000runs.

My experience has been a significant acceleration with avx2 intrinsics but for some reason this line decides to fail me. What am I doing wrong please?

1

There are 1 answers

1
RamblingMad On

What are you doing wrong? Not trusting your compiler.

This is not a case for manual optimization, any respectable compiler could vectorize that.