Intel Advisor's bandwidth information

49 views Asked by At

While using Intel Advisor's roofline analysis view, we are presented data-bandwidth information for the different data-paths of the system i.e. DRAM, L3-, L2- and L1 caches. The program claims that it measures the bandwidths on the provided hardware i.e. these aren't theoretical estimates or information from the OS.

Question

Why is the DRAM bandwidth 25 GB/s for a single thread?

enter image description here

Code (for Intel compiler)

In order to see how much data the computer can lift in the shortest possible time using all the computational resources available, one could conceptualize a first-attempt:

    // test-parameters
    const auto size = std::size_t{50 * 1024 * 1024} / sizeof(double);
    const auto experiment_count = std::size_t{500};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // warm-up
    for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }

        // clear cache
        double* cache_clearer = nullptr;
        posix_memalign(reinterpret_cast<void**>(&cache_clearer), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (cache_clearer == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma optimize("", off)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            cache_clearer[index] = -1.0;
        }
        #pragma optimize("", on)
        
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        
        // deallocate resources
        free(data);
        free(cache_clearer);
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }

        // clear cache
        double* cache_clearer = nullptr;
        posix_memalign(reinterpret_cast<void**>(&cache_clearer), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (cache_clearer == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma optimize("", off)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            cache_clearer[index] = -1.0;
        }
        #pragma optimize("", on)
        
        const auto dur1 = omp_get_wtime() * 1E+6;
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
        
        // deallocate resources
        free(data);
        free(cache_clearer);
    }

Notes on code:

  1. Assumed to be a 'naive' approach, also linux-only. Should still serve as a rough indicator of model performance,
  2. using compiler flags -O3 -ffast-math -march=native,
  3. size is to be bigger than lowest level cache of system (here 50 MB),
  4. new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits),
  5. minimum latency is recorded to counter-act the effects of OS-scheduling: threads being taken off cores for a short while etc.,
  6. a warm-up run is done to counter-act the effects of dynamic frequency scaling (kernel feature, can alternatively be turned off by using the userspace governor).

Results of code

On my machine, using AVX2 instructions (highest vector instructions available), I am realizing a max. bandwidth of 5.6 GB/s.

EDIT

Following @Peter Cordes' comment, I adapted my code to make sure memory page placement has taken place. Now my measured BW is 90 GB/s. Any explanation why its so high?

0

There are 0 answers