In an attempt to measure the bandwidth of the main memory, I have come up with the following approach.
Code (for the Intel compiler)
#include <omp.h>
#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
#include <random> // std::mt19937
int main()
{
// test-parameters
const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
const auto experiment_count = std::size_t{500};
//+/////////////////
// access a data-point 'on a whim'
//+/////////////////
// warm-up
for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
{
// garbage data allocation and memory page loading
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
//#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = -1.0;
}
//#pragma omp parallel for simd safelen(8) schedule(static)
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 10.0;
}
// deallocate resources
free(data);
}
// timed run
auto min_duration = std::numeric_limits<double>::max();
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
// garbage data allocation and memory page loading
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
//#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = -1.0;
}
const auto dur1 = omp_get_wtime() * 1E+6;
//#pragma omp parallel for simd safelen(8) schedule(static)
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 10.0;
}
const auto dur2 = omp_get_wtime() * 1E+6;
const auto run_duration = dur2 - dur1;
if (run_duration < min_duration)
{
min_duration = run_duration;
}
// deallocate resources
free(data);
}
// REPORT
const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
<< "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
return 0;
}
Notes on code
- Assumed to be a 'naive' approach, also linux-only. Should still serve as a rough indicator of model performance
- using ICC with compiler flags
-O3 -ffast-math -march=coffeelake
- size (150 MiB) is much bigger than lowest level cache of system (9 MiB on i5-8400 Coffee Lake), with 2x 16 GiB DIMM DDR4 3200 MT/s
- new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits)
- minimum latency is recorded to counter-act the effects of interrupts and OS-scheduling: threads being taken off cores for a short while etc.
- a warm-up run is done to counter-act the effects of dynamic frequency scaling (kernel feature, can alternatively be turned off by using the
userspace
governor).
Results of code
On my machine, I am getting 90 GB/s. Intel Advisor, which runs its own benchmarks, has calculated or measured this bandwidth to actually be 25 GB/s. (See my previous question: Intel Advisor's bandwidth information where a previous version of this code was getting page-faults inside the timed region.)
Assembly: here's a link to the assembly generated for the above code: https://godbolt.org/z/Ma7PY49bE
I am not able to understand how I'm getting such an unreasonably high result with my bandwidth. Any tips to help facilitate my understanding would be greatly appreciated.
Actually, the question seems to be, "why is the obtained bandwidth so high?", to which I have gotten quite a lot of input from @PeterCordes and @Sebastian. This information needs to be digested in its own time.
I can still offer an auxiliary 'answer' to the topic of interest. By substituting the write operation (which, as I now understand, cannot be properly modeled in a benchmark without delving into the assembly) by a cheap e.g. a bitwise operation, we can prevent the compiler from doing its job a little too well.
Updated code
The benchmark remains a 'naive' one and shall only serve as an indicator of the model's performance (as opposed to a program which can exactly calculate the memory bandwidth).
With the updated code, I get 24 GiB/s for single thread and 37 GiB/s when all 6 cores get involved. When compared to Intel Advisor's measured values of 25.5 GiB/s and 37.5 GiB/s, I think this is acceptable.
@PeterCordes I have retained the warm-up loop to once do an exactly identical run of the whole procedure so as to counter-act against effects unknown (healthy programmer's paranoia).
Edit In this case, the warm-up loop is indeed redundant because the minimum duration is being clocked.