I'm trying to write a non-cache-polluting memcpy (using PREFETCHNTA for reads and streaming writes) and first doing some artificial benchmarking to determine what prefetch distances work well. I've seen the comments on other questions that explain that the prefetch distance is tricky: needs to be big enough to prefetch ahead of the hardware prefetcher, but not so big that you end up evicting data you haven't used yet.
On my laptop (i7-9750H - Coffee Lake) I can clearly see improvements with prefetching, but I'm struggling to get any improvements on a server with a 7313P processor (Zen 3). AMD's software optimization guide says "Lines filled into the L2 cache with PREFETCHNTA are marked for quicker eviction from the L2, and when evicted from the L2 are not inserted into the L3." So I'm not sure what I'm missing.
My benchmark code is at the bottom, but I'll summarise what it does. It alternates between two operations:
- Pointer chasing over a 64KiB-buffer, accessing each cache line just once. The order of accesses is randomised to stop the hardware prefetcher doing its thing. This buffer models the useful working set of a program, and the time taken for this step gives an indication of how much of the buffer is in cache (and at what levels). Ideally it should all stay resident in L2.
- A memory copy between two 2MiB buffers, with (optional) prefetch ahead (PREFETCHNTA) and streaming writes (MOVNTDQ). The goal is to prevent this copy from evicting the working set from cache.
Here are results for the Zen 3 system. Columns are: prefetch distance (0 = no prefetch, -1 = skip the memory copy), working set size, copy size, average clock cycles per access in the pointer chase.
-1 65536 2097152 12.1564
0 65536 2097152 29.1469
64 65536 2097152 29.0625
128 65536 2097152 29.0669
192 65536 2097152 29.175
256 65536 2097152 29.0757
320 65536 2097152 29.2022
384 65536 2097152 33.0847
448 65536 2097152 37.0652
512 65536 2097152 38.932
576 65536 2097152 36.0577
640 65536 2097152 34.6365
704 65536 2097152 33.7632
768 65536 2097152 33.3275
832 65536 2097152 32.9455
896 65536 2097152 32.9707
960 65536 2097152 33.6334
1024 65536 2097152 33.4342
1088 65536 2097152 34.389
1152 65536 2097152 37.1528
1216 65536 2097152 38.5614
1280 65536 2097152 37.4452
1344 65536 2097152 36.1937
1408 65536 2097152 36.0173
1472 65536 2097152 36.8326
1536 65536 2097152 34.5126
1600 65536 2097152 32.7
1664 65536 2097152 31.5056
1728 65536 2097152 31.9708
1792 65536 2097152 30.8124
1856 65536 2097152 31.0403
1920 65536 2097152 31.7575
1984 65536 2097152 31.2979
2048 65536 2097152 30.6439
2112 65536 2097152 30.1825
2176 65536 2097152 29.1199
2240 65536 2097152 29.3742
2304 65536 2097152 29.2532
2368 65536 2097152 29.2181
2432 65536 2097152 29.1223
2496 65536 2097152 29.267
2560 65536 2097152 30.2443
2624 65536 2097152 30.6144
2688 65536 2097152 30.8426
2752 65536 2097152 29.9651
2816 65536 2097152 29.5222
2880 65536 2097152 29.1229
2944 65536 2097152 29.1275
3008 65536 2097152 29.1352
3072 65536 2097152 29.1416
3136 65536 2097152 29.2427
3200 65536 2097152 29.1352
3264 65536 2097152 29.1354
3328 65536 2097152 29.1615
3392 65536 2097152 29.1375
3456 65536 2097152 29.1363
3520 65536 2097152 29.1498
3584 65536 2097152 29.1539
3648 65536 2097152 29.2143
3712 65536 2097152 29.5351
3776 65536 2097152 34.9945
3840 65536 2097152 38.906
3904 65536 2097152 40.2393
3968 65536 2097152 39.7453
4032 65536 2097152 41.3991
4096 65536 2097152 36.469
6144 65536 2097152 33.8587
8192 65536 2097152 39.1409
10240 65536 2097152 36.5648
12288 65536 2097152 41.2544
14336 65536 2097152 39.0771
16384 65536 2097152 40.9359
18432 65536 2097152 35.6414
20480 65536 2097152 39.0697
22528 65536 2097152 30.1477
24576 65536 2097152 35.6944
26624 65536 2097152 29.572
28672 65536 2097152 32.1981
30720 65536 2097152 30.2285
So with no interfering memory copy, latency is 12 cycles, which is exactly what one would expect for L2. With the no-prefetch memory copy, some of the working set buffer has been evicted (but probably not all, since the average L3 latency is documented as 46 cycles). Using prefetch doesn't seem to help for any prefetch distance, and with larger distances it makes things worse.
In contrast, here are the results from my laptop:
0 65536 2097152 23.6615
64 65536 2097152 23.7485
128 65536 2097152 23.7565
192 65536 2097152 23.7445
256 65536 2097152 23.7279
320 65536 2097152 23.7211
384 65536 2097152 23.7471
448 65536 2097152 23.6438
512 65536 2097152 23.3186
576 65536 2097152 23.519
640 65536 2097152 23.1771
704 65536 2097152 22.6914
768 65536 2097152 21.7402
832 65536 2097152 19.4424
896 65536 2097152 15.0594
960 65536 2097152 12.2094
1024 65536 2097152 12.1211
1088 65536 2097152 12.1419
1152 65536 2097152 12.3308
1216 65536 2097152 12.2499
1280 65536 2097152 12.1953
1344 65536 2097152 12.2788
1408 65536 2097152 12.23
1472 65536 2097152 12.1945
1536 65536 2097152 14.6288
1600 65536 2097152 13.376
1664 65536 2097152 12.68
1728 65536 2097152 12.3322
1792 65536 2097152 12.2406
1856 65536 2097152 12.1883
1920 65536 2097152 12.1939
1984 65536 2097152 12.31
2048 65536 2097152 12.2258
2112 65536 2097152 12.1902
2176 65536 2097152 12.1935
2240 65536 2097152 12.2089
2304 65536 2097152 12.1706
2368 65536 2097152 12.4522
2432 65536 2097152 12.2515
2496 65536 2097152 12.2945
2560 65536 2097152 12.2528
2624 65536 2097152 12.2355
2688 65536 2097152 12.1507
2752 65536 2097152 12.2564
2816 65536 2097152 12.2669
2880 65536 2097152 12.1732
2944 65536 2097152 12.2326
3008 65536 2097152 12.2278
3072 65536 2097152 12.1361
3136 65536 2097152 12.4784
3200 65536 2097152 12.1725
3264 65536 2097152 12.3304
3328 65536 2097152 12.2806
3392 65536 2097152 12.2576
3456 65536 2097152 12.1469
3520 65536 2097152 12.5054
3584 65536 2097152 12.2468
3648 65536 2097152 12.3043
3712 65536 2097152 12.6612
3776 65536 2097152 12.2839
3840 65536 2097152 12.4222
3904 65536 2097152 12.1416
3968 65536 2097152 12.1885
4032 65536 2097152 12.235
4096 65536 2097152 12.5117
6144 65536 2097152 12.3215
8192 65536 2097152 12.2702
10240 65536 2097152 12.2117
12288 65536 2097152 12.2067
14336 65536 2097152 12.4986
16384 65536 2097152 12.4072
18432 65536 2097152 23.7586
20480 65536 2097152 23.7468
22528 65536 2097152 23.7671
24576 65536 2097152 23.7554
26624 65536 2097152 23.8179
28672 65536 2097152 23.7606
30720 65536 2097152 23.7888
That nicely illustrates that prefetch distances between about 1024 bytes and 16384 bytes avoid evicting the working set from L2.
I've taken a number of steps to improve accuracy:
- Linux cpufreq governor set to performance
- On the Zen 3 system: C states limited to C1, DF C-states disabled, APB disabled, SMT disabled
- Boost disabled
- Code pinned to a single CPU core
- The buffers are all allocated in huge pages to ensure alignment and reduce TLB misses (and the copy buffers have extra padding so that prefetches from beyond the end of the copy also prefetch from huge pages)
Has anyone had success with using PREFETCHNTA to reduce cache pollution on Zen 3? Is there something I'm missing in my code?
Code (compiled with g++ -std=c++17 -o mem_prefetch mem_prefetch.cpp -Wall -g -O3 on Ubuntu 22.04):
#include <iostream>
#include <vector>
#include <limits>
#include <cmath>
#include <cstdint>
#include <cstddef>
#include <cstring>
#include <cassert>
#include <random>
#include <algorithm>
#include <emmintrin.h>
#include <immintrin.h>
#include <x86intrin.h>
#include <sys/mman.h>
#include <unistd.h>
static constexpr unsigned int cache_line_size = 64;
template<typename T>
class huge_allocator
{
public:
using value_type = T;
huge_allocator() = default;
template<typename T2>
constexpr huge_allocator(const huge_allocator<T2> &) noexcept {}
T *allocate(std::size_t n)
{
if (n >= std::numeric_limits<std::size_t>::max() / sizeof(T))
throw std::bad_array_new_length();
void *ptr = mmap(
nullptr, n * sizeof(T),
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE,
-1, 0);
if (ptr == MAP_FAILED)
throw std::bad_alloc();
return static_cast<T *>(ptr);
}
void deallocate(T *p, std::size_t n) noexcept
{
munmap(p, n);
}
};
template<typename T1, typename T2>
bool operator==(const huge_allocator<T1> &, const huge_allocator<T2> &) { return true; }
template<typename T1, typename T2>
bool operator!=(const huge_allocator<T1> &, const huge_allocator<T2> &) { return false; }
// memcpy, with SSE2 streaming stores (assumes 16-byte aligned data and 64-byte aligned size for simplicity)
static void *memcpy_stream_sse2(void * __restrict__ dest, const void * __restrict__ src, std::size_t n, std::size_t prefetch_dist) noexcept
{
assert(std::uintptr_t(dest) % 16 == 0);
assert(std::uintptr_t(src) % 16 == 0);
assert(n % 64 == 0);
char * __restrict__ dest_c = (char *) dest;
const char * __restrict__ src_c = (const char *) src;
std::size_t offset;
for (offset = 0; offset < n; offset += 64)
{
if (prefetch_dist > 0)
_mm_prefetch(src_c + offset + prefetch_dist, _MM_HINT_NTA);
__m128i value0 = _mm_loadu_si128((__m128i const *) (src_c + offset + 0));
__m128i value1 = _mm_loadu_si128((__m128i const *) (src_c + offset + 16));
__m128i value2 = _mm_loadu_si128((__m128i const *) (src_c + offset + 32));
__m128i value3 = _mm_loadu_si128((__m128i const *) (src_c + offset + 48));
_mm_stream_si128((__m128i *) (dest_c + offset + 0), value0);
_mm_stream_si128((__m128i *) (dest_c + offset + 16), value1);
_mm_stream_si128((__m128i *) (dest_c + offset + 32), value2);
_mm_stream_si128((__m128i *) (dest_c + offset + 48), value3);
}
_mm_sfence();
return dest;
}
template<typename A>
static void chase(const std::vector<void *, A> &table) noexcept
{
void *p = table[0];
void *q = p;
std::size_t steps = 0;
constexpr int unroll = 8;
const std::size_t lines = table.size() / (cache_line_size / sizeof(void *));
assert(lines % unroll == 0);
do {
for (int i = 0; i < unroll; i++)
p = *reinterpret_cast<void **>(p);
steps += unroll;
} while (p != q);
[[maybe_unused]] volatile std::size_t sink = steps; // prevent compiler optimising it away
assert(steps == lines);
}
static std::vector<void *, huge_allocator<void *>> build_chase(std::size_t nbytes)
{
assert(nbytes % cache_line_size == 0);
std::size_t n = nbytes / cache_line_size;
static constexpr std::size_t scale = cache_line_size / sizeof(void *);
std::vector<void *, huge_allocator<void *>> table(scale * n);
std::mt19937_64 engine;
std::vector<std::size_t> perm(n);
for (std::size_t i = 0; i < n; i++)
perm[i] = i;
std::shuffle(perm.begin(), perm.end(), engine);
for (std::size_t i = 0; i + 1 < n; i++)
table[perm[i] * scale] = &table[perm[i + 1] * scale];
table[perm.back() * scale] = &table[perm[0] * scale];
return table;
}
static inline void serialize()
{
asm __volatile__("lfence" ::: "memory");
}
static inline std::int64_t time_start()
{
unsigned int dummy;
serialize();
auto tsc = _rdtscp(&dummy);
serialize();
return tsc;
}
static inline std::int64_t time_stop()
{
unsigned int dummy;
serialize();
auto tsc = _rdtscp(&dummy);
serialize();
return tsc;
}
int main(int argc, char **argv)
{
int chase_size = 65536;
int copy_size = 2 * 1024 * 1024;
int passes = 100;
int opt;
while ((opt = getopt(argc, argv, "c:m:p:")) != -1)
{
switch (opt)
{
case 'c':
chase_size = atoi(optarg);
break;
case 'm':
copy_size = atoi(optarg);
break;
case 'p':
passes = atoi(optarg);
break;
default:
std::cerr << "Usage: " << argv[0] << " [-c chase-size] [-m copy-size] [-p passes]\n";
std::exit(2);
}
}
constexpr int padding = 65536; // space that prefetches will target
std::vector<std::byte, huge_allocator<std::byte>> copy_src(copy_size + padding);
std::vector<std::byte, huge_allocator<std::byte>> copy_dest(copy_size + padding);
auto chase_table = build_chase(chase_size);
auto measure = [&](int prefetch_dist)
{
// Warmup
chase(chase_table);
if (prefetch_dist >= 0)
memcpy_stream_sse2(copy_dest.data(), copy_src.data(), copy_size, prefetch_dist);
std::uint64_t cycles = 0;
for (int i = 0; i < passes; i++)
{
auto start = time_start();
chase(chase_table);
auto stop = time_stop();
cycles += stop - start;
if (prefetch_dist >= 0)
memcpy_stream_sse2(copy_dest.data(), copy_src.data(), copy_size, prefetch_dist);
}
std::size_t chase_lines = chase_size / cache_line_size;
std::cout << prefetch_dist << "\t" << chase_size << "\t" << copy_size << "\t" << double(cycles) / passes / chase_lines << '\n';
};
measure(-1);
for (int prefetch_dist = 0; prefetch_dist < 4096; prefetch_dist += cache_line_size)
measure(prefetch_dist);
for (int prefetch_dist = 4096; prefetch_dist < 32768; prefetch_dist += 32 * cache_line_size)
measure(prefetch_dist);
}
The benchmark methodology looks pretty reasonable to me. For allocating hugepages, I've normally used
madviseto hint that it use transparent hugepages (with sysctl tuning set to defrag on madvise), sinceMAP_HUGETLBrequires some specific setup and permissions to use. But for a benchmark that's a good way of forcing it. I assume you've tried perf stat to check that TLB misses were low. But TLB effects couldn't explain these results; a few misses during the pointer-chase phase wouldn't make it that slow, and the no-copy test confirms that it can be fast on Zen 3.So your testing seems to indicate that Zen 3 isn't doing much of anything with the non-temporal hint, or at least not for avoiding L2 pollution. It might be avoiding or minimizing L1d or L3 pollution, we don't know.
It's unlikely that the NT stores are a problem; you could try commenting out the loads and make it basically a memset of zeros. And similarly comment out the stores, and just do prefetch + loads. (Perhaps AND them together and
volatile __m128i sink = _mm_and_si128(v1, v2);to stop them optimizing away.) And maybe try just the prefetches without the loads, maybe with some ALU delay in there since prefetches might get dropped when there are no free load buffers or off-core-request buffers. (Like perhapsx += 1.123to limit to one prefetch per 4 cycles, FP ALU latency.)Does AMD's optimization manual say anything about PREFETCHNTA? Intel's does some about which levels of cache it affects how in different microarchitectures, e.g. that on Xeons with non-inclusive L3, it bypasses that entirely, vs. limiting to one "way" of each set of L3 in CPUs with inclusive L3. And always bypassing L2. IIRC, Intel P6/SnB-family doesn't limit how PREFETCHNTA can write L1d, since tuning the prefetch distance is already hard enough.
It's totally plausible that on AMD, PREFETCHNTA bypasses L3 but not L2. So that would be something to test with bigger pointer-chasing sets. That would be my guess based on this test result, if it does anything with the hint at all, but I haven't checked any documentation.