OpenMP overhead and linux kernel version

764 views Asked by At

I have used a little test program to test the efficiency of OpenMP for parallelizing a recursive computation using arbitrary precision with the mpfr/gmp libraries. As expected OpenMP overhead makes the parallel version slower at first, but with sufficient bits used the parallel version becomes faster.

The sequential loops go like:

....
for ( i = 0; i < 1000; i++ ) {
    mpfr_set_d ( z1, 0.0, MPFR_RNDN );
    mpfr_set_d ( z2, 0.0, MPFR_RNDN );
    ...
    iter = 0;
    while ( iter < 10000 ) {
         mpfr_sqr ( tmp1, z1, MPFR_RNDN );
         mpfr_sqr ( tmp2, z2, MPFR_RNDN );
         mpfr_sub ( tr, tmp1, tmp2, MPFR_RNDN );
         mpfr_add ( tr, tr, cr, MPFR_RNDN );
         mpfr_mul_2si ( tmp3, z1, 1, MPFR_RNDN );
         ...
         iter++;
    }
}

and the parallel version:

....
omp_set_dynamic(0);
for ( i = 0; i < 10; i++ ) {
    mpfr_set_d ( z2, 0.0, MPFR_RNDN );
    mpfr_set_d ( z1, 0.0, MPFR_RNDN );
    ...
    iter = 0;
    while ( iter < 10000 ) {
#pragma omp parallel num_threads(4)
    {
        switch ( omp_get_thread_num() ) {
        case 0:
        mpfr_sqr ( tmp1, z1, MPFR_RNDN );
        mpfr_sqr ( tmp2, z2, MPFR_RNDN );
        mpfr_sub ( tr, tmp1, tmp2, MPFR_RNDN );
        mpfr_add ( tr, tr, cr, MPFR_RNDN ); break;
        case 1:
        mpfr_mul_2si ( tmp3, z1, 1, MPFR_RNDN );
        mpfr_mul ( ti, tmp3, z2, MPFR_RNDN );
        mpfr_add ( ti, ti, ci, MPFR_RNDN ); break;
        ...
        mpfr_mul_2si ( tti, tti, 1, MPFR_RNDN ); break;
        }
    }
        mpfr_set ( z1, tr, MPFR_RNDN );
        mpfr_set ( z2, ti, MPFR_RNDN );
        mpfr_set ( d1, ttr, MPFR_RNDN );
        mpfr_set ( d2, tti, MPFR_RNDN );
        iter++;
    }
}

Running times in seconds system A: Sequential

  1. 320 Bits: 11
  2. 640 Bits: 16
  3. 960 Bits: 21
  4. 2560 Bits: 60
  5. 5000 Bits: 152

Running times in seconds system A: Parallel

  1. 320 Bits: 15
  2. 640 Bits: 16
  3. 960 Bits: 18
  4. 2560 Bits: 32
  5. 5000 Bits: 65

Running times in seconds system B: Sequential

  1. 320 Bits: 13
  2. 640 Bits: 18
  3. 960 Bits: 27
  4. 2560 Bits: 80
  5. 5000 Bits: 202

Running times in seconds system B: Parallel

  1. 320 Bits: 51
  2. 640 Bits: 54
  3. 960 Bits: 56
  4. 2560 Bits: 76
  5. 5000 Bits: 128

System A is Fedora 19 kernel 3.11.10-200.fc19.x86_64

Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

System B is Linux Centos 6.5 kernel 2.6.32-431.1.2.0.1.el6.x86_64

Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz

ltrace shows about same percentages for called functions/system calls. Both systems use latest gmp, mpfr and gcc versions. Why is system B so much worse (e.g. many times more OpenMP overhead) than system A? Has the Linux kernel got so much better in this regard? Any kernel parameters etc. I should look at? CPU hardware differences/limitations? Any other explanations? Do I have to install Fedora 19 on B to fix this?

Update: Thanks for the tip. It did change results for system B.

Running times in seconds system B: Parallel

  1. 320 Bits: 51 -> 23
  2. 640 Bits: 54 -> 26
  3. 960 Bits: 56 -> 29
  4. 2560 Bits: 76 -> 47
  5. 5000 Bits: 128 -> 99

B still is behind A but the gap has got a lot smaller.

0

There are 0 answers