I've been doing some tests on OpenMP and made this program that should not scale because of false sharing of the array "sum". The problem I have is that it does scale. Even "worse":
- with 1 thread: 4 seconds (icpc), 4 seconds (g++)
- with 2 threads: 2 seconds (icpc), 2 seconds (g++)
- with 4 thread: 0.5 seconds (icpc), 1 seconds (g++)
I really don't get the speedup I get from 2 threads to 4 threads with the Intel compilers. But the most important is: why is scaling so good even though it should exhibit false sharing?
#include <iostream>
#include <chrono>
#include <array>
#include <omp.h>
int main(int argc, const char *argv[])
{
const auto nb_threads = std::size_t{4};
omp_set_num_threads(nb_threads);
const auto num_steps = std::size_t{1000000000};
const auto step = double{1.0 / num_steps};
auto sum = std::array<double, nb_threads>{0.0};
std::size_t actual_nb_threads;
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel
{
const auto id = std::size_t{omp_get_thread_num()};
if (id == 0) {
// This is needed because OMP might give us less threads
// than the numbers of threads requested
actual_nb_threads = omp_get_num_threads();
}
for (auto i = std::size_t{0}; i < num_steps; i += nb_threads) {
auto x = double{(i + 0.5) * step};
sum[id] += 4.0 / (1.0 + x * x);
}
}
auto pi = double{0.0};
for (auto id = std::size_t{0}; id < actual_nb_threads; id++) {
pi += step * sum[id];
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(end_time - start_time).count();
std::cout << "Pi: " << pi << std::endl;
std::cout << "Time: " << time / 1.0e9 << " seconds" << std::endl;
std::cout << "Total nb of threads actually used: " << actual_nb_threads << std::endl;
return 0;
}
That code definitely could exhibit false sharing, if the compiler chose to implement it that way. But that would be a silly thing for the compiler to do.
In the first loop, each thread only accesses one element of
sum
. There's no reason to makenum_steps
writes to the actual stack memory storing that element; it's much faster to just keep the value in a register, and write it back after the for-loop is over. Since the array is not volatile or atomic, there's nothing stopping the compiler from behaving in this way.And, of course, in the second loop there's no writing to the array, so no false sharing.