Consider the following loop, where I initialize an (aligned) array of complex numbers and would like to default-initialize them. I want to make use of SIMD for the sake of speedup:
constexpr auto alignment = 16u;
struct alignas(alignment) Complex { double re; double im; };
// ...
constexpr auto size = 32u;
auto* cv1 = static_cast<Complex*>(aligned_alloc(alignment, size));
#pragma omp simd
#pragma vector aligned
for (auto i = 0u; i < size; ++i) {
cv1[i] = Complex{0.0, 0.0}; // THIS IS THE PROBLEMATIC LINE
}
I am using #pragma omp simd to generate SIMD instruction and also Intel's #pragma vector aligned to indicate that my memory is aligned.
If I enable vectorization reports, the compiler displays the following message (see here on godbolt):
remark #15328: vectorization support: non-unit strided load was emulated for the variable <U5_V>, stride is 16
...
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 108
remark #15477: vector cost: 119.500
remark #15478: estimated potential speedup: 0.900
remark #15485: serialized function calls: 1
remark #15488: --- end vector cost summary ---
...
remark #15489: --- begin vector function matching report ---
remark #15490: Function call: ?1memset with simdlen=4, actual parameter types: (vector,uniform,uniform) [ <source>(26,9) ]
remark #26037: Library function call [ <source>(26,9) ]
remark #15493: --- end vector function matching report ---
Apparently the non-unit strided load hampers proper vectorization and the estimated speedup is less than 1. Now let's write the loop like this:
constexpr auto zero = Complex{0.0, 0.0};
#pragma omp simd
#pragma vector aligned
for (auto i = 0u; i < size; ++i) {
cv1[i] = zero;
}
Instead of assigning Complex{...} within the loop, I create a constant first and then assign it within the loop (see it on godbolt).
Now the compiler reports:
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 6
remark #15477: vector cost: 1.500
remark #15478: estimated potential speedup: 3.910
remark #15488: --- end vector cost summary ---
which is what I would expect for such a simple loop.
Can anyone explain why this happens? Shouldn't the results be identical for both cases?
What I understood so far is that the compiler tries to be smart and sees that cv1 could be replaced by a call so memset, which seems to impair optimization (quick verification: replace 0.0 by some other number). Is there a way to disable this "optimization"?