Increased speed despite false sharing

Question

Increased speed despite false sharing

406 views Asked by InsideLoop At 08 June 2015 at 09:07

I've been doing some tests on OpenMP and made this program that should not scale because of false sharing of the array "sum". The problem I have is that it does scale. Even "worse":

with 1 thread: 4 seconds (icpc), 4 seconds (g++)
with 2 threads: 2 seconds (icpc), 2 seconds (g++)
with 4 thread: 0.5 seconds (icpc), 1 seconds (g++)

I really don't get the speedup I get from 2 threads to 4 threads with the Intel compilers. But the most important is: why is scaling so good even though it should exhibit false sharing?

#include <iostream>
#include <chrono>

#include <array>

#include <omp.h>

int main(int argc, const char *argv[])
{
    const auto nb_threads = std::size_t{4};
    omp_set_num_threads(nb_threads);

    const auto num_steps = std::size_t{1000000000};
    const auto step = double{1.0 / num_steps};
    auto sum = std::array<double, nb_threads>{0.0};
    std::size_t actual_nb_threads;

    auto start_time = std::chrono::high_resolution_clock::now();
    #pragma omp parallel
    {
        const auto id = std::size_t{omp_get_thread_num()};
        if (id == 0) {
            // This is needed because OMP might give us less threads
            // than the numbers of threads requested
            actual_nb_threads = omp_get_num_threads();
        }
        for (auto i = std::size_t{0}; i < num_steps; i += nb_threads) {
            auto x = double{(i + 0.5) * step};
            sum[id] += 4.0 / (1.0 + x * x);
        }
    }
    auto pi = double{0.0};
    for (auto id = std::size_t{0}; id < actual_nb_threads; id++) {
        pi += step * sum[id];
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(end_time - start_time).count();

    std::cout << "Pi: " << pi << std::endl;
    std::cout << "Time: " << time / 1.0e9 << " seconds" << std::endl;
    std::cout << "Total nb of threads actually used: " << actual_nb_threads << std::endl;

    return 0;
}

Original Q&A

There are 1 answers

**Sneftel** · Answer 1 · 2015-06-08T09:18:06+00:00

That code definitely could exhibit false sharing, if the compiler chose to implement it that way. But that would be a silly thing for the compiler to do.

In the first loop, each thread only accesses one element of sum. There's no reason to make num_steps writes to the actual stack memory storing that element; it's much faster to just keep the value in a register, and write it back after the for-loop is over. Since the array is not volatile or atomic, there's nothing stopping the compiler from behaving in this way.

And, of course, in the second loop there's no writing to the array, so no false sharing.

TechQA.

Increased speed despite false sharing

There are 1 answers

Related Questions in C++

Related Questions in MULTITHREADING

Related Questions in OPENMP

Related Questions in FALSE-SHARING

Popular Questions

Trending Questions