I have used a little test program to test the efficiency of OpenMP for parallelizing a recursive computation using arbitrary precision with the mpfr/gmp libraries. As expected OpenMP overhead makes the parallel version slower at first, but with sufficient bits used the parallel version becomes faster.
The sequential loops go like:
....
for ( i = 0; i < 1000; i++ ) {
mpfr_set_d ( z1, 0.0, MPFR_RNDN );
mpfr_set_d ( z2, 0.0, MPFR_RNDN );
...
iter = 0;
while ( iter < 10000 ) {
mpfr_sqr ( tmp1, z1, MPFR_RNDN );
mpfr_sqr ( tmp2, z2, MPFR_RNDN );
mpfr_sub ( tr, tmp1, tmp2, MPFR_RNDN );
mpfr_add ( tr, tr, cr, MPFR_RNDN );
mpfr_mul_2si ( tmp3, z1, 1, MPFR_RNDN );
...
iter++;
}
}
and the parallel version:
....
omp_set_dynamic(0);
for ( i = 0; i < 10; i++ ) {
mpfr_set_d ( z2, 0.0, MPFR_RNDN );
mpfr_set_d ( z1, 0.0, MPFR_RNDN );
...
iter = 0;
while ( iter < 10000 ) {
#pragma omp parallel num_threads(4)
{
switch ( omp_get_thread_num() ) {
case 0:
mpfr_sqr ( tmp1, z1, MPFR_RNDN );
mpfr_sqr ( tmp2, z2, MPFR_RNDN );
mpfr_sub ( tr, tmp1, tmp2, MPFR_RNDN );
mpfr_add ( tr, tr, cr, MPFR_RNDN ); break;
case 1:
mpfr_mul_2si ( tmp3, z1, 1, MPFR_RNDN );
mpfr_mul ( ti, tmp3, z2, MPFR_RNDN );
mpfr_add ( ti, ti, ci, MPFR_RNDN ); break;
...
mpfr_mul_2si ( tti, tti, 1, MPFR_RNDN ); break;
}
}
mpfr_set ( z1, tr, MPFR_RNDN );
mpfr_set ( z2, ti, MPFR_RNDN );
mpfr_set ( d1, ttr, MPFR_RNDN );
mpfr_set ( d2, tti, MPFR_RNDN );
iter++;
}
}
Running times in seconds system A: Sequential
- 320 Bits: 11
- 640 Bits: 16
- 960 Bits: 21
- 2560 Bits: 60
- 5000 Bits: 152
Running times in seconds system A: Parallel
- 320 Bits: 15
- 640 Bits: 16
- 960 Bits: 18
- 2560 Bits: 32
- 5000 Bits: 65
Running times in seconds system B: Sequential
- 320 Bits: 13
- 640 Bits: 18
- 960 Bits: 27
- 2560 Bits: 80
- 5000 Bits: 202
Running times in seconds system B: Parallel
- 320 Bits: 51
- 640 Bits: 54
- 960 Bits: 56
- 2560 Bits: 76
- 5000 Bits: 128
System A is Fedora 19 kernel 3.11.10-200.fc19.x86_64
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
System B is Linux Centos 6.5 kernel 2.6.32-431.1.2.0.1.el6.x86_64
Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
ltrace shows about same percentages for called functions/system calls. Both systems use latest gmp, mpfr and gcc versions. Why is system B so much worse (e.g. many times more OpenMP overhead) than system A? Has the Linux kernel got so much better in this regard? Any kernel parameters etc. I should look at? CPU hardware differences/limitations? Any other explanations? Do I have to install Fedora 19 on B to fix this?
Update: Thanks for the tip. It did change results for system B.
Running times in seconds system B: Parallel
- 320 Bits: 51 -> 23
- 640 Bits: 54 -> 26
- 960 Bits: 56 -> 29
- 2560 Bits: 76 -> 47
- 5000 Bits: 128 -> 99
B still is behind A but the gap has got a lot smaller.