Auto-Vectorize max function, using Visual 2012

103 views Asked by At

I'm currently trying to run a simple "max function" loop scan on a large array of uint_32 values.

Using AVX2 intrinsic, it's rather straightforward :

const __m256i limit8 = _mm256_set1_epi32(limit);
for (i=0; i<TABLESIZE; i+=8)
{
    __m256i src = _mm256_loadu_si256((const __m256i*)(h+i));
            src = _mm256_max_epu32(src, limit8);
    _mm256_storeu_si256((__m256i*)(h+i), src);
}

The only important operation is _mm256_max_epu32 (vpmaxud), which efficiently does the requested work. All cells in the table are compared to a single constant.

Now, using intrinsic is a bit limitative in term of portability, and I would prefer to write an equivalent version using standard C, which the compiler would automatically vectorize. After all, the inner loop seems simple enough for cheap heuristic to find out.

Alas, I'm failing this simple exercise, even though the VS2012 note on auto-vectorization clearly states that this function should be correctly detected :

http://blogs.msdn.com/cfs-filesystemfile.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99/3007.Auto_2D00_Vectorizer_2D00_08_2D00_Cookbook.pdf

What I've tried :

for (i=0; i<TABLESIZE; ++i)
{
    if (h[i]>limit) h[i]=limit;
}

Doesn't work : in contrast to the cookbook statement, the "if" statement is the problem here : auto-vectorize fails on code 1100

for (i=0; i<TABLESIZE; ++i)
{
    h[i] = h[i] > limit ? h[i] : limit;
}

No better, although for a different reason : auto-vectorize fails on code 1304 (Loop includes assignments that are of different sizes), which is likely a bug, because all variables are using same type.

for (i=0; i<TABLESIZE; ++i)
{
    const U32 val = ((limit-h[i]) >> 31);
    h[i]-=limit; h[i]*=val; h[i]+=limit;
}

This one works, and is vectorized. But it's more complex, and run therefore noticeably slower than direct intrinsic version.

I'm wondering if there is a way to make this simple "max" operation be automatically vectorized by Visual (GCC and Clang to follow).

0

There are 0 answers