OpenMP parallelization is highly inefficient

89 views Asked by At

I'm trying to numerically solve a 2-D Poisson equation with around a 200x200 grid. I'm trying to implement the diagonal method to enable parallelism:

#include <math.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int xf = -1;
int xl = 1;
int yf = -1;
int yl = 1;

double delta = 0.01;

double q(int i, int j) {
  return (4 - 2 * (pow(xf + j * delta, 2) + pow(yl - i * delta, 2))) *
         pow(delta, 2);
}

double actual(double x, double y, int n) {
  return (pow(x, 2) - 1) * (pow(y, 2) - 1);
}

double error(double **mk, double **new, int n) {
  double sum1 = 0;
  double sum2 = 0;
  double mf;
  for (int i = 0; i < n; ++i) {
    for (int j = 0; j < n; ++j) {
      mf = mk[i][j];
      sum1 += pow(mf, 2);
      sum2 += pow(mf - new[i][j], 2);
    }
  }
  return pow((double)sum2 / sum1, 0.5);
}

int main(void) {

  int n = (xl - xf) / delta + 1;

  double **phiactual = (double **)malloc(n * sizeof(double *));
  for (int i = 0; i < n; i++)
    phiactual[i] = (double *)malloc(n * sizeof(double));

  double **phi = (double **)malloc(n * sizeof(double *));
  for (int i = 0; i < n; i++)
    phi[i] = (double *)malloc(n * sizeof(double));

  int i, j, d;

  int iter;
#pragma omp parallel shared(phi, phiactual, delta)
  {
    iter = 0;
#pragma omp for collapse(2)
    for (int q = 0; q < n; ++q) {
      for (int w = 0; w < n; ++w) {
        phiactual[q][w] = actual(xf + w * delta, yl - q * delta, delta);
      }
    }

    while (error(phiactual, phi, n) > 0.01) {
      for (int g = 0; g < 2 * n - 5; ++g) {
        if (g < n - 3) {
          i = 1;
          j = g + 1;
          d = g + 1;
        } else {
          i = g - n + 4;
          j = n - 2;
          d = 2 * n - 5 - g;
        }

#pragma omp for
        for (int k = 0; k < d; ++k) {
          phi[i][j] = 0.25 * (((double)(phi[i + 1][j] + phi[i][j + 1] +
                                        phi[i - 1][j] + phi[i][j - 1])) +
                              q(i, j));
          ++i;
          --j;
        }
      }
      iter++;
    }
#pragma omp single
    printf("%i\n", iter);
  }

  for (int i = 0; i < n; i++)
    free(phi[i]);
  free(phi);

  for (int i = 0; i < n; i++)
    free(phiactual[i]);
  free(phiactual);

  return 0;
}

The serial code takes 1m12s, while the parallel version just runs indefinitely. I used the schedule clause to try tuning the cache catches, but it does not help too.

The smaller version(20x20 grid) with schedule(dynamic,1024) works, but this essentially just makes it parallel.

I'm not sure where this inefficiency arises. Comments on the efficiency of serial code are also welcome.

3

There are 3 answers

16
Marcus Müller On

I'm actually surprised it's not worse. Look at what you ask openMP to parallelize:

for (int k = 0; k < d; ++k) {
  phi[i][j] = 0.25 * (((double)(phi[i + 1][j] + phi[i][j + 1] +
                                phi[i - 1][j] + phi[i][j - 1])) +
                      q(i, j));
  ++i;
  --j;

You ask to run the iterations of this for loop on different threads, but the calculations depend on the values from the previous iteration! (the current iteration's phi[i+1][j] is the next iterations phi[i][j+1]).

So, what you've built is an algorithm that is impossible to parallelize in this straightforward manner. It's sequential, at least in the way you stated it.

4
Craig Estey On

Comments on the efficiency of serial code are also welcome.

You do not have double 2D arrays of dimension NxN. You have 1D arrays of pointers of dimension N to 1D arrays of double of dimension N.

Real 2D arrays might be faster.

Also, doing pow(x,2) is slower than x * x. And pow(x,0.5) might be sqrt(x)

Also, forced inlining some simple functions.

Edit: See the UPDATE below for an ~2x speedup.


Here is the refactored code. I added some timing/benchmarking:

#include <math.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define inline_always   static inline __attribute__((always_inline))

inline_always double
pow2(double x,int y)
{

#if 0
    return pow(x,y);
#else
    return x * x;
#endif
}

inline_always double
pow5(double x,double y)
{

#if 0
    return pow(x,y);
#else
    return sqrt(x);
#endif
}

int xf = -1;
int xl = 1;
int yf = -1;
int yl = 1;

int N;

double
tscgetf(void)
{
    struct timespec ts;
    double sec;

    clock_gettime(CLOCK_MONOTONIC,&ts);

    sec = ts.tv_nsec;
    sec /= 1e9;
    sec += ts.tv_sec;

    return sec;
}

double delta = 0.01;

double
q(int i, int j)
{
    return (4 - 2 * (pow2(xf + j * delta, 2) + pow2(yl - i * delta, 2))) * pow2(delta, 2);
}

inline_always double
actual(double x, double y, int n)
{
    return (pow2(x, 2) - 1) * (pow2(y, 2) - 1);
}

double
error(int N, double (*mk)[N], double (*new)[N])
{
    double sum1 = 0;
    double sum2 = 0;
    double mf;

    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            mf = mk[i][j];
            sum1 += pow2(mf, 2);
            sum2 += pow2(mf - new[i][j], 2);
        }
    }

    return pow5((double) sum2 / sum1, 0.5);
}

int
main(void)
{

    int N = (xl - xf) / delta + 1;

#if 0
    double **phiactual = (double **) malloc(n * sizeof(double *));
    for (int i = 0; i < n; i++)
        phiactual[i] = (double *) malloc(n * sizeof(double));
#else
    double (*phiactual)[N] = malloc(N * N * sizeof(double));
#endif

#if 0
    double **phi = (double **) malloc(n * sizeof(double *));
    for (int i = 0; i < n; i++)
        phi[i] = (double *) malloc(n * sizeof(double));
#else
    double (*phi)[N] = malloc(N * N * sizeof(double));
#endif

    int i, j, d;

    int iter;

    double tscbeg = tscgetf();

#pragma omp parallel shared(phi, phiactual, delta)
    {
        iter = 0;
#pragma omp for collapse(2)
        for (int q = 0; q < N; ++q) {
            for (int w = 0; w < N; ++w) {
                phiactual[q][w] = actual(xf + w * delta, yl - q * delta, delta);
            }
        }

        while (error(N, phiactual, phi) > 0.01) {
            for (int g = 0; g < 2 * N - 5; ++g) {
                if (g < N - 3) {
                    i = 1;
                    j = g + 1;
                    d = g + 1;
                }
                else {
                    i = g - N + 4;
                    j = N - 2;
                    d = 2 * N - 5 - g;
                }

#pragma omp for
                for (int k = 0; k < d; ++k) {
                    phi[i][j] = 0.25 * (((double) (phi[i + 1][j] + phi[i][j + 1] + phi[i - 1][j] + phi[i][j - 1])) + q(i, j));
                    ++i;
                    --j;
                }
            }
            iter++;
        }
#pragma omp single

        double tscend = tscgetf();
        printf("%i\n", iter);

        printf("ELAPSED: %.9f\n",tscend - tscbeg);
    }

#if 0
    for (int i = 0; i < n; i++)
        free(phi[i]);
#endif
    free(phi);

#if 0
    for (int i = 0; i < n; i++)
        free(phiactual[i]);
#endif
    free(phiactual);

    return 0;
}

In the code above, I've used cpp conditionals to denote old vs. new code:

#if 0
// old code
#else
// new code
#endif

#if 1
// new code
#endif

Note: this can be cleaned up by running the file through unifdef -k


On my system, the elapsed time for the code (in seconds):

  • 7.141084219 original
  • 5.819279067 refactored

UPDATE:

By adding some debug tracing, I noticed that the sequence of values from q is repeated for each iteration of the while (error(...)) loop.

So, we can cache those values in an array.

Here is the updated code (compile with -DQCACHE=1):

// fix2 -- debug, mp/np row pointers, Q caching
#include <math.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#if DEBUG
#define dbgprt(_fmt...) \
    fprintf(stderr,_fmt)
#else
#define dbgprt(_fmt...) \
    do { } while (0)
#endif

#define inline_always   static inline __attribute__((always_inline))

inline_always double
pow2(double x,int y)
{

#if 0
    return pow(x,y);
#else
    return x * x;
#endif
}

inline_always double
pow5(double x,double y)
{

#if 0
    return pow(x,y);
#else
    return sqrt(x);
#endif
}

int xf = -1;
int xl = 1;
int yf = -1;
int yl = 1;

double
tscgetf(void)
{
    struct timespec ts;
    double sec;

    clock_gettime(CLOCK_MONOTONIC,&ts);

    sec = ts.tv_nsec;
    sec /= 1e9;
    sec += ts.tv_sec;

    return sec;
}

double delta = 0.01;

int qidx;                               // current q index
int qmax;                               // maximum q index
double *qcache;                         // cache pointer

double
qcalc(int i, int j)
{
    double qval =
        (4 - 2 * (pow2(xf + j * delta, 2) + pow2(yl - i * delta, 2))) *
        pow2(delta, 2);

    dbgprt("qcalc: QCALC i=%d j=%d qval=%.15e qidx=%d\n",i,j,qval,qidx);

    return qval;
}

// qprep -- initialize Q cache
void
qprep(int N)
{
#if QCACHE
    int i;
    int j;
    int d;

    dbgprt("qprep: ENTER N=%d\n",N);

    qidx = 0;

    for (int g = 0; g < 2 * N - 5; ++g) {
        if (g < N - 3) {
            i = 1;
            j = g + 1;
            d = g + 1;
        }
        else {
            i = g - N + 4;
            j = N - 2;
            d = 2 * N - 5 - g;
        }

        dbgprt("qprep: LOOP g=%d i=%d j=%d d=%d\n",g,i,j,d);

        for (int k = 0; k < d; ++k) {
            double qval = qcalc(i,j);

            // grow cache array
            if (qidx >= qmax) {
                qmax += 100;
                qcache = realloc(qcache,sizeof(*qcache) * qmax);
                if (qcache == NULL) {
                    perror("realloc");
                    exit(1);
                }
            }

            qcache[qidx] = qval;

            ++qidx;
            ++i;
            --j;
        }
    }

    dbgprt("qprep: EXIT qidx=%d qmax=%d\n",qidx,qmax);
#endif
}

inline_always double
actual(double x, double y, int n)
{
    return (pow2(x, 2) - 1) * (pow2(y, 2) - 1);
}

double
error(int N, double (*mk)[N], double (*new)[N])
{
    double sum1 = 0;
    double sum2 = 0;
    double mf;

    for (int i = 0; i < N; ++i) {
#if 1
        const double *mp = mk[i];
        const double *np = new[i];
#endif
        for (int j = 0; j < N; ++j) {
            mf = mp[j];
            sum1 += pow2(mf, 2);
            sum2 += pow2(mf - np[j], 2);
        }
    }

    double err = pow5((double) sum2 / sum1, 0.5);

    dbgprt("error: err=%.15e sum2=%.15e sum1=%.15e\n",err,sum2,sum1);

    return err;
}

int
main(void)
{

#ifdef NDEF
    enum { N = NDEF };
#else
    int N = (xl - xf) / delta + 1;
#endif
    printf("N %d\n",N);

#if 0
    double **phiactual = (double **) malloc(n * sizeof(double *));
    for (int i = 0; i < n; i++)
        phiactual[i] = (double *) malloc(n * sizeof(double));
#else
    double (*phiactual)[N] = malloc(N * N * sizeof(double));
#endif

#if 0
    double **phi = (double **) malloc(n * sizeof(double *));
    for (int i = 0; i < n; i++)
        phi[i] = (double *) malloc(n * sizeof(double));
#else
    double (*phi)[N] = malloc(N * N * sizeof(double));
#endif

    int i, j, d;

    int iter;

    double tscbeg = tscgetf();
    double tscend;

#pragma omp parallel shared(phi, phiactual, delta)
    {
        iter = 0;
#pragma omp for collapse(2)
        for (int q = 0; q < N; ++q) {
            for (int w = 0; w < N; ++w) {
                phiactual[q][w] = actual(xf + w * delta, yl - q * delta, delta);
            }
        }

#if QCACHE
        qprep(N);
#endif

        while (error(N, phiactual, phi) > 0.01) {
            dbgprt("main: ENTER iter=%d\n",iter);
            qidx = 0;

            for (int g = 0; g < 2 * N - 5; ++g) {
                if (g < N - 3) {
                    i = 1;
                    j = g + 1;
                    d = g + 1;
                }
                else {
                    i = g - N + 4;
                    j = N - 2;
                    d = 2 * N - 5 - g;
                }

                dbgprt("main: LOOP g=%d i=%d j=%d d=%d\n",g,i,j,d);

#pragma omp for
                for (int k = 0; k < d; ++k) {
#if QCACHE
                    double qval = qcache[qidx];
                    dbgprt("main: QCACHE i=%d j=%d qval=%.15e qidx=%d\n",
                        i,j,qval,qidx);
#else
                    double qval = qcalc(i,j);
#endif
                    ++qidx;

                    phi[i][j] = 0.25 * (((double)
                        (phi[i + 1][j] + phi[i][j + 1] + phi[i - 1][j] +
                        phi[i][j - 1])) + qval);

                    ++i;
                    --j;
                }
            }
            dbgprt("main: EXIT iter=%d\n",iter);
            iter++;

            // debug stop after iteration
#if STOP
            if (iter >= STOP)
                break;
#endif
        }
#pragma omp single

        tscend = tscgetf();
        printf("%i\n", iter);

        printf("ELAPSED: %.9f\n",tscend - tscbeg);
    }

#if 0
    for (int i = 0; i < n; i++)
        free(phi[i]);
#endif
    free(phi);

#if 0
    for (int i = 0; i < n; i++)
        free(phiactual[i]);
#endif
    free(phiactual);

    return 0;
}

Here are some more benchmarks:

  1. orig is the original unmodified code
  2. fix1 is the code posted above
  3. fix2 is the improved code (compiled without -DQCACHE=1)
  4. fix3 is that code with -DQCACHE=1
BEST-O0 71.180772044 1.000x orig -- original code
BEST-O0 23.221787086 3.065x fix1 -- conversion to real 2D arrays, pow2, pow5
BEST-O0 25.597060688 2.781x fix2 -- debug, mp/np row pointers, Q caching
BEST-O0 15.749082142 4.520x fix3 -- fix2 with -DQCACHE=1

BEST-O2 6.540591962 1.000x orig -- original code
BEST-O2 5.494974338 1.190x fix1 -- conversion to real 2D arrays, pow2, pow5
BEST-O2 5.080663799 1.287x fix2 -- debug, mp/np row pointers, Q caching
BEST-O2 3.229317797 2.025x fix3 -- fix2 with -DQCACHE=1

BEST-O3 5.245094514 1.000x orig -- original code
BEST-O3 5.080274417 1.032x fix1 -- conversion to real 2D arrays, pow2, pow5
BEST-O3 5.080663799 1.032x fix2 -- debug, mp/np row pointers, Q caching
BEST-O3 2.988738434 1.755x fix3 -- fix2 with -DQCACHE=1
2
John Bollinger On

The serial code takes 1m12s,

This is barely plausible if you are compiling without enabling optimization. When I do that, however, I get run times on the order of 30s. Hardware differs, of course, but that 2.5x speed difference is pretty big, so I wonder what else you may have going on.

In any event, when I enable GCC optimization level -O3, my execution time comes down to about 2.6s. Performance comparisons on un-optimized binaries are pretty pointless.

while the parallel version just runs indefinitely.

I can't confirm "indefinitely", but I do confirm that with or without optimization, compiling with -fopenmp increased the run time to longer than I was willing to wait. There are numerous issues with your parallelization attempt, but the one that seems most directly related to that problems seems to be the

#pragma omp parallel shared(phi, phiactual, delta)

directive. Removing that and the (then) unneeded #pragma omp single later in the code brings the runtime down to that of the version compiled for serial execution for me.

I'm uncertain what your intention was with that directive, but I'm confident that it does not mean what you think it means. The #pragma omp parallel part requests that the whole associated block be run by multiple threads in parallel, with the number of threads chosen dynamically (details of that choice are well defined, but irrelevant here). You definitely do not want that. Perhaps the shared(phi, phiactual, delta) part was your focus, but

  1. that doesn't make sense separate from a construct that produces parallel execution, and
  2. under the circumstances, shared is the default for these anyway, and
  3. it hardly matters for phi and phiactual anyway, because the data-sharing mode applies to the pointers themselves, not the data to which they point.

Even so, your serial code is loaded down with numerous data dependencies that interfere with parallelization. @MarcusMüller discusses some of these in his answer. Your inner loop also relies on shared variables i and j, which at minimum is likely to make the results of parallel execution differ from those of serial execution. It may also contribute to lack of speedup -- not that it helps to compute a wrong answer quickly!

Overall, the Gauss Seidel method is not parallelizable. At each iteration, it relies on matrix elements to be computed in a specific order, with the computation of later elements depending on the previously-computed values of earlier elements. There may be small parts of it that can be parallelized, such as the computation of the error term, but even the maximum possible speedup available from doing that is modest.