I do not understand why such code is not vectorized with gcc 4.4.6
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + pfTab[iIndex];
}
note: not vectorized: unhandled data-ref
However, if I write the following code
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
gcc succeeds auto-vectorize this loop
if I add omp directive
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
#pragma omp parallel for
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
i have the following error not vectorized: unhandled data-ref
Could you please help me why the first code and third code is not auto-vectorized ?
Second question: math operand seems to be not vectorized (exp, log , etc...), this code for example
for (int i = 0; i < iSize; i++)
pfResult[i] = exp(pfResult[i]);
is not vectorized. It is due to my version of gcc ?
Edit: with new version of gcc 4.8.1 and openMP 2011 (echo |cpp -fopenmp -dM |grep -i open) i have the following error for all kind of loop even basically
for (iGID = 0; iGID < iSize; iGID++)
{
pfResult[iGID] = fValue;
}
note: not consecutive access *_144 = 5.0e-1;
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
Edit2:
#include<stdio.h>
#include<sys/time.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int szGlobalWorkSize = 131072;
int iGID = 0;
int j = 0;
omp_set_dynamic(0);
// warmup
#if WARMUP
#pragma omp parallel
{
#pragma omp master
{
printf("%d threads\n", omp_get_num_threads());
}
}
#endif
printf("Pagesize=%d\n", getpagesize());
float *pfResult = (float *)malloc(szGlobalWorkSize * 100* sizeof(float));
float fValue = 0.5f;
struct timeval tim;
gettimeofday(&tim, NULL);
double tLaunch1=tim.tv_sec+(tim.tv_usec/1000000.0);
double time = omp_get_wtime();
int iChunk = getpagesize();
int iSize = ((int)szGlobalWorkSize * 100) / iChunk;
//#pragma omp parallel for
for (iGID = 0; iGID < iSize; iGID++)
{
pfResult[iGID] = fValue;
}
time = omp_get_wtime() - time;
gettimeofday(&tim, NULL);
double tLaunch2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf Time1\n", tLaunch2-tLaunch1);
printf("%.6lf Time2\n", time);
}
result with
#define _OPENMP 201107
gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
gcc -march=native -fopenmp -O3 -ftree-vectorizer-verbose=2 test.c -lm
lot of
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
and note: not consecutive access *_144 = 5.0e-1;
Thanks
GCC cannot vectorise the first version of your loop because it cannot prove that
pfTab[iIndex]
is not contained somewhere within the memory spanned bypfResult[0] ... pfResult[iSize-1]
(pointer aliasing). Indeed, ifpfTab[iIndex]
is somewhere within that memory, then its value must be overwritten by the assignment in the loop body and the new value must be used in the iterations to follow. You should use therestrict
keyword to hint the compiler that this could never happen and then it should happily vectorise your code:The second version vectorises since the value is transferred to a variable with an automatic storage duration. The general assumption here is that
pfResult
does not span over the stack memory wherefTab
is stored (a cursory read through the C99 language specification doesn't make it clear if that assumption is weak or something in the standard allows it).The OpenMP version does not vectorise because of the way OpenMP is implemented in GCC. It uses code outlining for the parallel regions.
effectively becomes:
MyFunc_omp_fn0
contains the outlined function code. The compiler is not able to prove thatomp_data_i->pfResult
does not point to memory that aliasesomp_data_i
and specifically its memberfTab
.In order to vectorise that loop, you have to make
fTab
firstprivate
. This will turn it into an automatic variable in the outlined code and that will be equivalent to your second case: