Not getting any cache-pollution benefit from PREFETCHNTA on Zen 3

74 views Asked by At

I'm trying to write a non-cache-polluting memcpy (using PREFETCHNTA for reads and streaming writes) and first doing some artificial benchmarking to determine what prefetch distances work well. I've seen the comments on other questions that explain that the prefetch distance is tricky: needs to be big enough to prefetch ahead of the hardware prefetcher, but not so big that you end up evicting data you haven't used yet.

On my laptop (i7-9750H - Coffee Lake) I can clearly see improvements with prefetching, but I'm struggling to get any improvements on a server with a 7313P processor (Zen 3). AMD's software optimization guide says "Lines filled into the L2 cache with PREFETCHNTA are marked for quicker eviction from the L2, and when evicted from the L2 are not inserted into the L3." So I'm not sure what I'm missing.

My benchmark code is at the bottom, but I'll summarise what it does. It alternates between two operations:

  1. Pointer chasing over a 64KiB-buffer, accessing each cache line just once. The order of accesses is randomised to stop the hardware prefetcher doing its thing. This buffer models the useful working set of a program, and the time taken for this step gives an indication of how much of the buffer is in cache (and at what levels). Ideally it should all stay resident in L2.
  2. A memory copy between two 2MiB buffers, with (optional) prefetch ahead (PREFETCHNTA) and streaming writes (MOVNTDQ). The goal is to prevent this copy from evicting the working set from cache.

Here are results for the Zen 3 system. Columns are: prefetch distance (0 = no prefetch, -1 = skip the memory copy), working set size, copy size, average clock cycles per access in the pointer chase.

-1  65536   2097152 12.1564
0   65536   2097152 29.1469
64  65536   2097152 29.0625
128 65536   2097152 29.0669
192 65536   2097152 29.175
256 65536   2097152 29.0757
320 65536   2097152 29.2022
384 65536   2097152 33.0847
448 65536   2097152 37.0652
512 65536   2097152 38.932
576 65536   2097152 36.0577
640 65536   2097152 34.6365
704 65536   2097152 33.7632
768 65536   2097152 33.3275
832 65536   2097152 32.9455
896 65536   2097152 32.9707
960 65536   2097152 33.6334
1024    65536   2097152 33.4342
1088    65536   2097152 34.389
1152    65536   2097152 37.1528
1216    65536   2097152 38.5614
1280    65536   2097152 37.4452
1344    65536   2097152 36.1937
1408    65536   2097152 36.0173
1472    65536   2097152 36.8326
1536    65536   2097152 34.5126
1600    65536   2097152 32.7
1664    65536   2097152 31.5056
1728    65536   2097152 31.9708
1792    65536   2097152 30.8124
1856    65536   2097152 31.0403
1920    65536   2097152 31.7575
1984    65536   2097152 31.2979
2048    65536   2097152 30.6439
2112    65536   2097152 30.1825
2176    65536   2097152 29.1199
2240    65536   2097152 29.3742
2304    65536   2097152 29.2532
2368    65536   2097152 29.2181
2432    65536   2097152 29.1223
2496    65536   2097152 29.267
2560    65536   2097152 30.2443
2624    65536   2097152 30.6144
2688    65536   2097152 30.8426
2752    65536   2097152 29.9651
2816    65536   2097152 29.5222
2880    65536   2097152 29.1229
2944    65536   2097152 29.1275
3008    65536   2097152 29.1352
3072    65536   2097152 29.1416
3136    65536   2097152 29.2427
3200    65536   2097152 29.1352
3264    65536   2097152 29.1354
3328    65536   2097152 29.1615
3392    65536   2097152 29.1375
3456    65536   2097152 29.1363
3520    65536   2097152 29.1498
3584    65536   2097152 29.1539
3648    65536   2097152 29.2143
3712    65536   2097152 29.5351
3776    65536   2097152 34.9945
3840    65536   2097152 38.906
3904    65536   2097152 40.2393
3968    65536   2097152 39.7453
4032    65536   2097152 41.3991
4096    65536   2097152 36.469
6144    65536   2097152 33.8587
8192    65536   2097152 39.1409
10240   65536   2097152 36.5648
12288   65536   2097152 41.2544
14336   65536   2097152 39.0771
16384   65536   2097152 40.9359
18432   65536   2097152 35.6414
20480   65536   2097152 39.0697
22528   65536   2097152 30.1477
24576   65536   2097152 35.6944
26624   65536   2097152 29.572
28672   65536   2097152 32.1981
30720   65536   2097152 30.2285

So with no interfering memory copy, latency is 12 cycles, which is exactly what one would expect for L2. With the no-prefetch memory copy, some of the working set buffer has been evicted (but probably not all, since the average L3 latency is documented as 46 cycles). Using prefetch doesn't seem to help for any prefetch distance, and with larger distances it makes things worse.

In contrast, here are the results from my laptop:

0   65536   2097152 23.6615
64  65536   2097152 23.7485
128 65536   2097152 23.7565
192 65536   2097152 23.7445
256 65536   2097152 23.7279
320 65536   2097152 23.7211
384 65536   2097152 23.7471
448 65536   2097152 23.6438
512 65536   2097152 23.3186
576 65536   2097152 23.519
640 65536   2097152 23.1771
704 65536   2097152 22.6914
768 65536   2097152 21.7402
832 65536   2097152 19.4424
896 65536   2097152 15.0594
960 65536   2097152 12.2094
1024    65536   2097152 12.1211
1088    65536   2097152 12.1419
1152    65536   2097152 12.3308
1216    65536   2097152 12.2499
1280    65536   2097152 12.1953
1344    65536   2097152 12.2788
1408    65536   2097152 12.23
1472    65536   2097152 12.1945
1536    65536   2097152 14.6288
1600    65536   2097152 13.376
1664    65536   2097152 12.68
1728    65536   2097152 12.3322
1792    65536   2097152 12.2406
1856    65536   2097152 12.1883
1920    65536   2097152 12.1939
1984    65536   2097152 12.31
2048    65536   2097152 12.2258
2112    65536   2097152 12.1902
2176    65536   2097152 12.1935
2240    65536   2097152 12.2089
2304    65536   2097152 12.1706
2368    65536   2097152 12.4522
2432    65536   2097152 12.2515
2496    65536   2097152 12.2945
2560    65536   2097152 12.2528
2624    65536   2097152 12.2355
2688    65536   2097152 12.1507
2752    65536   2097152 12.2564
2816    65536   2097152 12.2669
2880    65536   2097152 12.1732
2944    65536   2097152 12.2326
3008    65536   2097152 12.2278
3072    65536   2097152 12.1361
3136    65536   2097152 12.4784
3200    65536   2097152 12.1725
3264    65536   2097152 12.3304
3328    65536   2097152 12.2806
3392    65536   2097152 12.2576
3456    65536   2097152 12.1469
3520    65536   2097152 12.5054
3584    65536   2097152 12.2468
3648    65536   2097152 12.3043
3712    65536   2097152 12.6612
3776    65536   2097152 12.2839
3840    65536   2097152 12.4222
3904    65536   2097152 12.1416
3968    65536   2097152 12.1885
4032    65536   2097152 12.235
4096    65536   2097152 12.5117
6144    65536   2097152 12.3215
8192    65536   2097152 12.2702
10240   65536   2097152 12.2117
12288   65536   2097152 12.2067
14336   65536   2097152 12.4986
16384   65536   2097152 12.4072
18432   65536   2097152 23.7586
20480   65536   2097152 23.7468
22528   65536   2097152 23.7671
24576   65536   2097152 23.7554
26624   65536   2097152 23.8179
28672   65536   2097152 23.7606
30720   65536   2097152 23.7888

That nicely illustrates that prefetch distances between about 1024 bytes and 16384 bytes avoid evicting the working set from L2.


I've taken a number of steps to improve accuracy:

  • Linux cpufreq governor set to performance
  • On the Zen 3 system: C states limited to C1, DF C-states disabled, APB disabled, SMT disabled
  • Boost disabled
  • Code pinned to a single CPU core
  • The buffers are all allocated in huge pages to ensure alignment and reduce TLB misses (and the copy buffers have extra padding so that prefetches from beyond the end of the copy also prefetch from huge pages)

Has anyone had success with using PREFETCHNTA to reduce cache pollution on Zen 3? Is there something I'm missing in my code?

Code (compiled with g++ -std=c++17 -o mem_prefetch mem_prefetch.cpp -Wall -g -O3 on Ubuntu 22.04):

#include <iostream>
#include <vector>
#include <limits>
#include <cmath>
#include <cstdint>
#include <cstddef>
#include <cstring>
#include <cassert>
#include <random>
#include <algorithm>
#include <emmintrin.h>
#include <immintrin.h>
#include <x86intrin.h>
#include <sys/mman.h>
#include <unistd.h>

static constexpr unsigned int cache_line_size = 64;

template<typename T>
class huge_allocator
{
public:
    using value_type = T;

    huge_allocator() = default;

    template<typename T2>
    constexpr huge_allocator(const huge_allocator<T2> &) noexcept {}

    T *allocate(std::size_t n)
    {
        if (n >= std::numeric_limits<std::size_t>::max() / sizeof(T))
            throw std::bad_array_new_length();
        void *ptr = mmap(
            nullptr, n * sizeof(T),
            PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE,
            -1, 0);
        if (ptr == MAP_FAILED)
            throw std::bad_alloc();
        return static_cast<T *>(ptr);
    }

    void deallocate(T *p, std::size_t n) noexcept
    {
        munmap(p, n);
    }
};

template<typename T1, typename T2>
bool operator==(const huge_allocator<T1> &, const huge_allocator<T2> &) { return true; }

template<typename T1, typename T2>
bool operator!=(const huge_allocator<T1> &, const huge_allocator<T2> &) { return false; }

// memcpy, with SSE2 streaming stores (assumes 16-byte aligned data and 64-byte aligned size for simplicity)
static void *memcpy_stream_sse2(void * __restrict__ dest, const void * __restrict__ src, std::size_t n, std::size_t prefetch_dist) noexcept
{
    assert(std::uintptr_t(dest) % 16 == 0);
    assert(std::uintptr_t(src) % 16 == 0);
    assert(n % 64 == 0);
    char * __restrict__ dest_c = (char *) dest;
    const char * __restrict__ src_c = (const char *) src;
    std::size_t offset;
    for (offset = 0; offset < n; offset += 64)
    {
        if (prefetch_dist > 0)
            _mm_prefetch(src_c + offset + prefetch_dist, _MM_HINT_NTA);
        __m128i value0 = _mm_loadu_si128((__m128i const *) (src_c + offset + 0));
        __m128i value1 = _mm_loadu_si128((__m128i const *) (src_c + offset + 16));
        __m128i value2 = _mm_loadu_si128((__m128i const *) (src_c + offset + 32));
        __m128i value3 = _mm_loadu_si128((__m128i const *) (src_c + offset + 48));
        _mm_stream_si128((__m128i *) (dest_c + offset + 0), value0);
        _mm_stream_si128((__m128i *) (dest_c + offset + 16), value1);
        _mm_stream_si128((__m128i *) (dest_c + offset + 32), value2);
        _mm_stream_si128((__m128i *) (dest_c + offset + 48), value3);
    }
    _mm_sfence();
    return dest;
}

template<typename A>
static void chase(const std::vector<void *, A> &table) noexcept
{
    void *p = table[0];
    void *q = p;
    std::size_t steps = 0;
    constexpr int unroll = 8;
    const std::size_t lines = table.size() / (cache_line_size / sizeof(void *));
    assert(lines % unroll == 0);
    do {
        for (int i = 0; i < unroll; i++)
            p = *reinterpret_cast<void **>(p);
        steps += unroll;
    } while (p != q);
    [[maybe_unused]] volatile std::size_t sink = steps; // prevent compiler optimising it away
    assert(steps == lines);
}

static std::vector<void *, huge_allocator<void *>> build_chase(std::size_t nbytes)
{
    assert(nbytes % cache_line_size == 0);
    std::size_t n = nbytes / cache_line_size;
    static constexpr std::size_t scale = cache_line_size / sizeof(void *);
    std::vector<void *, huge_allocator<void *>> table(scale * n);

    std::mt19937_64 engine;
    std::vector<std::size_t> perm(n);
    for (std::size_t i = 0; i < n; i++)
        perm[i] = i;
    std::shuffle(perm.begin(), perm.end(), engine);
    for (std::size_t i = 0; i + 1 < n; i++)
        table[perm[i] * scale] = &table[perm[i + 1] * scale];
    table[perm.back() * scale] = &table[perm[0] * scale];
    return table;
}

static inline void serialize()
{
    asm __volatile__("lfence" ::: "memory");
}

static inline std::int64_t time_start()
{
    unsigned int dummy;
    serialize();
    auto tsc = _rdtscp(&dummy);
    serialize();
    return tsc;
}

static inline std::int64_t time_stop()
{
    unsigned int dummy;
    serialize();
    auto tsc = _rdtscp(&dummy);
    serialize();
    return tsc;
}

int main(int argc, char **argv)
{
    int chase_size = 65536;
    int copy_size = 2 * 1024 * 1024;
    int passes = 100;
    int opt;
    while ((opt = getopt(argc, argv, "c:m:p:")) != -1)
    {
        switch (opt)
        {
        case 'c':
            chase_size = atoi(optarg);
            break;
        case 'm':
            copy_size = atoi(optarg);
            break;
        case 'p':
            passes = atoi(optarg);
            break;
        default:
            std::cerr << "Usage: " << argv[0] << " [-c chase-size] [-m copy-size] [-p passes]\n";
            std::exit(2);
        }
    }

    constexpr int padding = 65536;  // space that prefetches will target
    std::vector<std::byte, huge_allocator<std::byte>> copy_src(copy_size + padding);
    std::vector<std::byte, huge_allocator<std::byte>> copy_dest(copy_size + padding);
    auto chase_table = build_chase(chase_size);

    auto measure = [&](int prefetch_dist)
    {
        // Warmup
        chase(chase_table);
        if (prefetch_dist >= 0)
            memcpy_stream_sse2(copy_dest.data(), copy_src.data(), copy_size, prefetch_dist);
        std::uint64_t cycles = 0;
        for (int i = 0; i < passes; i++)
        {
            auto start = time_start();
            chase(chase_table);
            auto stop = time_stop();
            cycles += stop - start;
            if (prefetch_dist >= 0)
                memcpy_stream_sse2(copy_dest.data(), copy_src.data(), copy_size, prefetch_dist);
        }
        std::size_t chase_lines = chase_size / cache_line_size;
        std::cout << prefetch_dist << "\t" << chase_size << "\t" << copy_size << "\t" << double(cycles) / passes / chase_lines << '\n';
    };

    measure(-1);
    for (int prefetch_dist = 0; prefetch_dist < 4096; prefetch_dist += cache_line_size)
        measure(prefetch_dist);
    for (int prefetch_dist = 4096; prefetch_dist < 32768; prefetch_dist += 32 * cache_line_size)
        measure(prefetch_dist);
}
1

There are 1 answers

1
Peter Cordes On

The benchmark methodology looks pretty reasonable to me. For allocating hugepages, I've normally used madvise to hint that it use transparent hugepages (with sysctl tuning set to defrag on madvise), since MAP_HUGETLB requires some specific setup and permissions to use. But for a benchmark that's a good way of forcing it. I assume you've tried perf stat to check that TLB misses were low. But TLB effects couldn't explain these results; a few misses during the pointer-chase phase wouldn't make it that slow, and the no-copy test confirms that it can be fast on Zen 3.

So your testing seems to indicate that Zen 3 isn't doing much of anything with the non-temporal hint, or at least not for avoiding L2 pollution. It might be avoiding or minimizing L1d or L3 pollution, we don't know.

It's unlikely that the NT stores are a problem; you could try commenting out the loads and make it basically a memset of zeros. And similarly comment out the stores, and just do prefetch + loads. (Perhaps AND them together and volatile __m128i sink = _mm_and_si128(v1, v2); to stop them optimizing away.) And maybe try just the prefetches without the loads, maybe with some ALU delay in there since prefetches might get dropped when there are no free load buffers or off-core-request buffers. (Like perhaps x += 1.123 to limit to one prefetch per 4 cycles, FP ALU latency.)

Does AMD's optimization manual say anything about PREFETCHNTA? Intel's does some about which levels of cache it affects how in different microarchitectures, e.g. that on Xeons with non-inclusive L3, it bypasses that entirely, vs. limiting to one "way" of each set of L3 in CPUs with inclusive L3. And always bypassing L2. IIRC, Intel P6/SnB-family doesn't limit how PREFETCHNTA can write L1d, since tuning the prefetch distance is already hard enough.

It's totally plausible that on AMD, PREFETCHNTA bypasses L3 but not L2. So that would be something to test with bigger pointer-chasing sets. That would be my guess based on this test result, if it does anything with the hint at all, but I haven't checked any documentation.