GCC fails to vectorize a simple 2-level nested loop while Intel compiler succeeds

326 views Asked by At

I have the following two versions of the same loop:

// version 1
for(int x = 0; x < size_x; ++x)
{
    for(int y = 0; y < size_y; ++y)
    {
        data[y*size_x + x] = value;
    }
}

// version 2
for(int y = 0; y < size_y; ++y)
{
    for(int x = 0; x < size_x; ++x)
    {
        data[y*size_x + x] = value;
    }
}

I compile the above codes using two compilers:

  • Intel (17.0.1): I compile the code using: icc -qopenmp -O3 -qopt-report main.cpp. Both are vectorized successfully.

  • GCC (5.1): I compile the code using: g++ -fopenmp -ftree-vectorize -fopt-info-vec -O3 main.cpp. Only version 2 is vectorized.

Here are my questions:

  • Why GCC fails to vectorize version 1? Is it because the inner loop in version 1 doesn't access contiguous memory?
  • If the answer to the above is 'yes': is it for GCC impossible to vectorize it or it chooses not to because it won't have any performance benefit? If it is the latter, can I somehow force GCC to vectorize it no matter what?
  • Apparently in version 1 the vectorization report of the Intel compiler includes these lines: Loopnest Interchanged: ( 1 2 ) --> ( 2 1 ) and PERMUTED LOOP WAS VECTORIZED; while in version two I get this: LOOP WAS VECTORIZED. So it appears that Intel compiler rearranges the order of the loop in order to vectorize it? Do I understand this correct?
  • Can I achieve something similar to the above with GCC?

EDIT 1:

Thanks to MarcGlisse I investigated further by creating a simplified example of my code and realized that different combination of my data size and compilation flags on GCC will achieve different vectorization. At this point I am more confused and I think it is better to create a new post to first understand how GCC vectorization works. In case someone is curious you can check the code below and try the values 1, 2, 3, 4, 5, 6, 7 for size_x and size_y. Also try them once with MarcGlisse's compilation flags and once without. Different combinations might give different vectorization results.

void foo1(int size_x, int size_y, float value, float* data)
{
    for(int x = 0; x < size_x; ++x)
    {
        for(int y = 0; y < size_y; ++y)
        {
            data[y*size_x + x] = value;
        }
    }
}

void foo2(int size_x, int size_y, float value, float* data)
{
    for(int y = 0; y < size_y; ++y)
    {
        for(int x = 0; x < size_x; ++x)
        {
            data[y*size_x + x] = value;
        }
    }
}

int main(int argc, char** argv)
{
    int size_x = 7;
    int size_y = 7;
    int size = size_x*size_y;
    float* data1 = new float[size];
    float* data2 = new float[size];

    foo1(size_x, size_y, 1, data1);
    foo2(size_x, size_y, 1, data2);

    delete [] data1;
    delete [] data2;

    return 0;
}
0

There are 0 answers