I just found out that while this C code gives an ordered list of integers (as expected):
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
#pragma omp parallel for ordered schedule(dynamic)
for (int i=0; i<10; i++) {
#pragma omp ordered
{
printf("%i (tid=%i)\n",i,omp_get_thread_num(); fflush(stdout);
}
}
}
With both gcc as well as icc, the following gives undesired behaviour:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
#pragma omp parallel for ordered schedule(dynamic)
for (int i=0; i<10; i++) {
#pragma omp ordered
{
printf("%i (tid=%i)\n",i,omp_get_thread_num()); fflush(stdout);
}
usleep(100*omp_get_thread_num());
printf("WORK IS DONE (tid=%i)\n",omp_get_thread_num()); fflush(stdout);
usleep(100*omp_get_thread_num());
#pragma omp ordered
{
printf(" %i (tid=%i)\n",i,omp_get_thread_num()); fflush(stdout);
}
}
}
What I'd love to see is:
0
1
2
3
4
5
6
7
8
9
WORK IS DONE
WORK IS DONE
WORK IS DONE
WORK IS DONE
WORK IS DONE
WORK IS DONE
WORK IS DONE
WORK IS DONE
WORK IS DONE
WORK IS DONE
0
1
2
3
4
5
6
7
8
9
But with gcc is get:
0 (tid=5)
WORK IS DONE (tid=5)
0 (tid=5)
1 (tid=2)
WORK IS DONE (tid=2)
1 (tid=2)
2 (tid=0)
WORK IS DONE (tid=0)
2 (tid=0)
3 (tid=6)
WORK IS DONE (tid=6)
3 (tid=6)
4 (tid=7)
WORK IS DONE (tid=7)
4 (tid=7)
5 (tid=3)
WORK IS DONE (tid=3)
5 (tid=3)
6 (tid=4)
WORK IS DONE (tid=4)
6 (tid=4)
7 (tid=1)
WORK IS DONE (tid=1)
7 (tid=1)
8 (tid=5)
WORK IS DONE (tid=5)
8 (tid=5)
9 (tid=2)
WORK IS DONE (tid=2)
9 (tid=2)
(so everything get's ordered - even the parallelizable work part)
And with icc:
1 (tid=0)
2 (tid=5)
3 (tid=1)
4 (tid=2)
WORK IS DONE (tid=1)
WORK IS DONE (tid=3)
3 (tid=1)
6 (tid=4)
7 (tid=7)
8 (tid=1)
WORK IS DONE (tid=0)
5 (tid=6)
WORK IS DONE (tid=2)
1 (tid=0)
9 (tid=0)
WORK IS DONE (tid=0)
WORK IS DONE (tid=5)
WORK IS DONE (tid=1)
9 (tid=0)
0 (tid=3)
8 (tid=1)
WORK IS DONE (tid=4)
WORK IS DONE (tid=6)
2 (tid=5)
WORK IS DONE (tid=7)
6 (tid=4)
5 (tid=6)
4 (tid=2)
7 (tid=7)
(so nothing get's ordered not even the ordered clauses)
Is using multiple ordered clauses within one ordered loop undefined behaviour or what is going on here? I couldn't find anything disallowing multiple clauses per loop in any of the OpenMP documentations I could find.
I know that in this trivial example I could just part the loops like
int main() {
for (int i=0; i<10; i++) {
printf("%i (tid=%i)\n",i,omp_get_thread_num()); fflush(stdout);
}
#pragma omp parallel for schedule(dynamic)
for (int i=0; i<10; i++) {
usleep(100*omp_get_thread_num());
printf("WORK IS DONE (tid=%i)\n",omp_get_thread_num()); fflush(stdout);
usleep(100*omp_get_thread_num());
}
for (int i=0; i<10; i++) {
printf(" %i (tid=%i)\n",i,omp_get_thread_num()); fflush(stdout);
}
}
So I'm not looking for a workaround. I really want to understand what is going on here, so that I can handle the real situation without running into anything devastating/unexpected.
I really hope you can help me.
According to OpenMP 4.0 API specifications you can't.