Erroneous single thread memory bandwidth benchmark

284 views Asked by At

In an attempt to measure the bandwidth of the main memory, I have come up with the following approach.

Code (for the Intel compiler)

#include <omp.h>

#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
#include <random> // std::mt19937


int main()
{
    // test-parameters
    const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
    const auto experiment_count = std::size_t{500};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // warm-up
    for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        
        // deallocate resources
        free(data);
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        const auto dur1 = omp_get_wtime() * 1E+6;
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
        
        // deallocate resources
        free(data);
    }
    
    // REPORT
    const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
    std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
        << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
    
    return 0;
}

Notes on code

  1. Assumed to be a 'naive' approach, also linux-only. Should still serve as a rough indicator of model performance
  2. using ICC with compiler flags -O3 -ffast-math -march=coffeelake
  3. size (150 MiB) is much bigger than lowest level cache of system (9 MiB on i5-8400 Coffee Lake), with 2x 16 GiB DIMM DDR4 3200 MT/s
  4. new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits)
  5. minimum latency is recorded to counter-act the effects of interrupts and OS-scheduling: threads being taken off cores for a short while etc.
  6. a warm-up run is done to counter-act the effects of dynamic frequency scaling (kernel feature, can alternatively be turned off by using the userspace governor).

Results of code

On my machine, I am getting 90 GB/s. Intel Advisor, which runs its own benchmarks, has calculated or measured this bandwidth to actually be 25 GB/s. (See my previous question: Intel Advisor's bandwidth information where a previous version of this code was getting page-faults inside the timed region.)

Assembly: here's a link to the assembly generated for the above code: https://godbolt.org/z/Ma7PY49bE

I am not able to understand how I'm getting such an unreasonably high result with my bandwidth. Any tips to help facilitate my understanding would be greatly appreciated.

1

There are 1 answers

4
Nitin Malapally On

Actually, the question seems to be, "why is the obtained bandwidth so high?", to which I have gotten quite a lot of input from @PeterCordes and @Sebastian. This information needs to be digested in its own time.

I can still offer an auxiliary 'answer' to the topic of interest. By substituting the write operation (which, as I now understand, cannot be properly modeled in a benchmark without delving into the assembly) by a cheap e.g. a bitwise operation, we can prevent the compiler from doing its job a little too well.

Updated code

#include <omp.h>

#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign


int main()
{
    // test-parameters
    const auto size = std::size_t{100 * 1024 * 1024};
    const auto experiment_count = std::size_t{250};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // allocate for exp. data and load the memory pages
    char* data = nullptr;
    posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size);
    if (data == nullptr)
    {
        std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
        std::abort();
    }
    for (auto index = std::size_t{}; index < size; ++index)
    {
        data[index] = 0;
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // run
        const auto dur1 = omp_get_wtime() * 1E+6;
        #pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] ^= 1;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
    }
    
    // deallocate resources
    free(data);
        
    // REPORT
    const auto traffic = size * 2; // 1x load, 1x write
    std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
        << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
    
    return 0;
}

The benchmark remains a 'naive' one and shall only serve as an indicator of the model's performance (as opposed to a program which can exactly calculate the memory bandwidth).

With the updated code, I get 24 GiB/s for single thread and 37 GiB/s when all 6 cores get involved. When compared to Intel Advisor's measured values of 25.5 GiB/s and 37.5 GiB/s, I think this is acceptable.

@PeterCordes I have retained the warm-up loop to once do an exactly identical run of the whole procedure so as to counter-act against effects unknown (healthy programmer's paranoia).

Edit In this case, the warm-up loop is indeed redundant because the minimum duration is being clocked.