OpenMP loop gives different result to the same serial loop

1.8k views Asked by At

I have some serial code:

double* a = malloc((1000000) * sizeof(double));
double* b = malloc((1000000) * sizeof(double));
double totalA = 0;

for (int i = 0; i < 1000000; i++) {

    if (i == 0) {
        a[i] = sin(i);
    }

    b[i] = sin(i+1);

    if (i < 1000000-1) {
        a[i+1] = b[i];
    }

    totalA += a[i];
}

The output of totalA after this serial loop is 0.232883978073.

I then have an OpenMP version of this (note: all variables are re-initialised):

double* a = malloc((1000000) * sizeof(double));
double* b = malloc((1000000) * sizeof(double));
double totalA = 0;

#pragma omp parallel for reduction(+:totalA)
for (int i = 0; i < 1000000; i++) {

    if (i == 0) {
        a[i] = sin(i);
    }

    b[i] = sin(i+1);

    if (i < 1000000-1) {
        a[i+1] = b[i];
    }

    totalA += a[i];
}

However, the output of totalA from this code is -0.733714826779.

I can't figure out for the life of me why it is different.

Thanks.


UPDATE

After some more playing around it seems that the if statements within the loop are ignored oddly. The actual statements within the if blocks are executed on all iterations of the loop (as if the if clause doesn't exist).

For example, changing the if block to:

if (i < 555555) {
    a[i+1] = b[i];
}

seems to make absolutely no difference.

I am still none the wiser as to what is going on here.

2

There are 2 answers

7
Stefan Holdermans On BEST ANSWER

Your code contains a race condition. The conflicting statements are the assignment a[i+1] = b[i]; that writes to the array a and the statement totalA += a[i]; that reads from a.

In your code there is no guarantee that the iteration that is responsible for writing to a particular location in the array is executed before the iteration that reads from that location.

To further demonstrate the issue, ordering the loop segment that contains the conflicting statements resolves the problem (but is most likely devastating for your performance):

#pragma omp parallel for ordered reduction(+:totalA)    
for (int i = 0; i < 1000000; i++) {

   if (i == 0) {
     a[i] = sin(i);
   }

  b[i] = sin(i+1);

  #pragma omp ordered
  {
    if (i < 1000000-1) {
      a[i+1] = b[i];
    }

    totalA += a[i];
  }
}

It is probably better to avoid the problem altogether and rewrite your program to get rid of the loop-carried dependency:

#define N 1000000

double* a = malloc(N * sizeof(double));
double* b = malloc(N * sizeof(double));
double totalA = 0;

a[0] = sin(0);
totalA += a[0];

#pragma omp parallel for reduction(+:totalA)
for (int i = 0; i < N - 1; i++) {
  b[i] = sin(i + 1);
  a[i + 1] = b[i];
  totalA += a[i + 1];
}

b[N - 1] = sin(N);

Finally, note that, as sin(0) ≈ 0.0, the statements

a[0] = sin(0);
totalA += a[0];

can simply be replaced by

a[0] = 0.0;
0
Z boson On

You can simplify your main loop to:

a[0] = sin(0);
for (int i = 0; i < N-1; i++) {
    b[i] = sin(i+1);
    a[i+1] = b[i];
    totalA += a[i];
}
totalA += a[N-1];
b[N-1] = sin(N);

Let's call the iterator for thread1 i1 and thread2 i2. Then when e.g. i2=i1+1 the write to a[i1+1] and the read at a[i2] will be the same address and the value read or written will depend on the order of which thread is first. This is a race condition.

A simple fix is to observe that totalA = a[0] + b[0] + b[1] + ... b[N-2].

a[0] = sin(0);
totalA += a[0];
#pragma omp parallel for reduction(+:totalA)
for (int i = 0; i < N-1; i++) {
    b[i] = sin(i+1);
    a[i+1] = b[i];
    totalA += b[i];
}
b[N-1] = sin(N);

This produces 0.232883978073 for N=1000000.