I have the following two versions of the same loop:
// version 1
for(int x = 0; x < size_x; ++x)
{
for(int y = 0; y < size_y; ++y)
{
data[y*size_x + x] = value;
}
}
// version 2
for(int y = 0; y < size_y; ++y)
{
for(int x = 0; x < size_x; ++x)
{
data[y*size_x + x] = value;
}
}
I compile the above codes using two compilers:
Intel (17.0.1): I compile the code using:
icc -qopenmp -O3 -qopt-report main.cpp
. Both are vectorized successfully.GCC (5.1): I compile the code using:
g++ -fopenmp -ftree-vectorize -fopt-info-vec -O3 main.cpp
. Only version 2 is vectorized.
Here are my questions:
- Why GCC fails to vectorize version 1? Is it because the inner loop in version 1 doesn't access contiguous memory?
- If the answer to the above is 'yes': is it for GCC impossible to vectorize it or it chooses not to because it won't have any performance benefit? If it is the latter, can I somehow force GCC to vectorize it no matter what?
- Apparently in version 1 the vectorization report of the Intel compiler includes these lines:
Loopnest Interchanged: ( 1 2 ) --> ( 2 1 )
andPERMUTED LOOP WAS VECTORIZED
; while in version two I get this:LOOP WAS VECTORIZED
. So it appears that Intel compiler rearranges the order of the loop in order to vectorize it? Do I understand this correct? - Can I achieve something similar to the above with GCC?
EDIT 1:
Thanks to MarcGlisse I investigated further by creating a simplified example of my code and realized that different combination of my data
size and compilation flags on GCC will achieve different vectorization. At this point I am more confused and I think it is better to create a new post to first understand how GCC vectorization works. In case someone is curious you can check the code below and try the values 1, 2, 3, 4, 5, 6, 7 for size_x
and size_y
. Also try them once with MarcGlisse's compilation flags and once without. Different combinations might give different vectorization results.
void foo1(int size_x, int size_y, float value, float* data)
{
for(int x = 0; x < size_x; ++x)
{
for(int y = 0; y < size_y; ++y)
{
data[y*size_x + x] = value;
}
}
}
void foo2(int size_x, int size_y, float value, float* data)
{
for(int y = 0; y < size_y; ++y)
{
for(int x = 0; x < size_x; ++x)
{
data[y*size_x + x] = value;
}
}
}
int main(int argc, char** argv)
{
int size_x = 7;
int size_y = 7;
int size = size_x*size_y;
float* data1 = new float[size];
float* data2 = new float[size];
foo1(size_x, size_y, 1, data1);
foo2(size_x, size_y, 1, data2);
delete [] data1;
delete [] data2;
return 0;
}