I try to perform some computation asynchronously from an I/O bound operation. To do that I have used a pthread in which a loop is parallelized using OpenMP. However, this results in a performance degradation compared to the case where I perform the I/O bound operation in a pthread or when I do not create a pthread at all.
The minimal example that demonstrates the behavior (see below) uses usleep to simulate the I/O bound task. The output of this program is
g++ simple.cpp -O3 -fopenmp
export OMP_PROC_BIND=false
No pthread: 34.0884
Compute from pthread: 36.6323
Compute from main: 34.1188
I do not understand why there should be a performance penalty for using OpenMP in the pthread created. What is even more surprising to me is the following behavior.
export OMP_PROC_BIND=true
./a.out
No pthread: 34.0635
Compute from pthread: 60.6081
Compute from main: 34.0347
Why is there a factor of two in run time?
The source code is as follows
#include <iostream>
using namespace std;
#include <unistd.h>
#include <omp.h>
int n = 1000000000;
int rep = 100;
double elapsed;
void compute() {
double t = omp_get_wtime();
double s = 0.0;
#pragma omp parallel for reduction(+:s)
for(int i=0;i<n;i++)
s += 1/double(i+1);
elapsed += omp_get_wtime()-t;
}
void* run(void *threadid) {
compute();
}
void* wait(void* threadid) {
usleep(150e3);
}
int main() {
elapsed = 0;
for(int k=0;k<rep;k++)
compute();
cout << "No pthread: " << elapsed << endl;
elapsed = 0;
for(int k=0;k<rep;k++) {
pthread_t t; void* status;
pthread_create(&t, NULL, run, NULL);
usleep(150e3);
pthread_join(t, &status);
}
cout << "Compute from pthread: " << elapsed << endl;
elapsed = 0;
for(int k=0;k<rep;k++) {
pthread_t t; void* status;
pthread_create(&t, NULL, wait, NULL);
compute();
pthread_join(t, &status);
}
cout << "Compute from main: " << elapsed << endl;
}
First of all, a fair warning - creating parallel regions from threads not managed by OpenMP is non-standard behaviour and could result in fairly non-portable applications.
The GNU OpenMP runtime library (libgomp) uses a pool of threads to speed up the creation of thread teams. The pool is associated with the main thread of the parallel region by storing a reference to it in a special structure in the TLS (thread-local storage). When a parallel region is created, libgomp consults the TLS of the master thread in order to find the thread pool. If no thread pool exists, i.e. it is the first time a parallel region is being created by that thread, it creates the pool.
When you call
compute()
from the main thread, libgomp finds the reference to the thread pool in the TLS every time a new parallel region is created and uses it, therefore the price of spawning the threads is only paid the first time a parallel region is created.When you call
compute()
from a second thread, libgomp fails to find the special structure in the TLS and creates a new one together with a new thread pool. Once the thread terminates, a special destructor takes care of terminating the thread pool. Therefore, the thread pool gets creates every time you callcompute()
simply because you spawn a new thread in each iteraton of the outer loop. The overhead of 20 ms (2 seconds difference for 100 iterations) is quite a typical value for such operations.I fail to reproduce the case with
OMP_PROC_BIND=true
- the program runs basically the same as without it, even a tad better because of the binding. It is probably something specific to your combination of GCC version, OS, and CPU. Note that theparallel for
construct spreads the amount of work evenly spread between all threads of the team. If one of them gets delayed, e.g. by having to timeshare its CPU core with other processes, then it delays the completion of the entire parallel region. WithOMP_PROC_BIND=true
, the OS scheduler is not allowed to move the threads and they have to timeshare their CPU cores with threads from other processes. If one such external thread uses a lot of CPU, it will do more harm when the threads are bound than when they are not, since in the latter case the scheduler could move the affected thread(s). In other words, in that particular case, it is better to have the whole program delayed#cores / (#cores - 1)
times when all OpenMP threads have to share the all but one cores than have it delayed 100% because one bound thread has to share its CPU core 50/50 with that external thread. Of course, in that case one pays the price of the expensive cross-core move operation, but it could still lead to better utilisation if the external influence is too disruptive.