I found a comment from crossbeam.
Starting from Intel's Sandy Bridge, spatial prefetcher is now pulling pairs of 64-byte cache lines at a time, so we have to align to 128 bytes rather than 64.
Sources:
I didn't find the line in Intel's manual saying that. But until the latest commit, folly still uses 128 bytes padding, which makes it convincing to me. So I started to write code to see if I can observe this behavior. Here is my code.
#include <thread>
int counter[1024]{};
void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}
int main() {
    std::thread t1(update, 0);
    std::thread t2(update, 1);
    std::thread t3(update, 2);
    std::thread t4(update, 3);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}
My CPU is Ryzen 3700X. When the indexes are 0, 1, 2, 3, it takes ~1.2s to finish. When the indexes are 0, 16, 32, 48, it takes ~200ms to finish. When the indexes are 0, 32, 64, 96, it takes ~200ms to finish, which is exactly the same as before. I also tested them on an Intel machine and it gave me a similar result.
From this micro bench I can't see the reason to use 128 bytes padding over 64 bytes padding. Did I get anything wrong?
 
                        
Intel's optimization manual does describe the L2 spatial prefetcher in SnB-family CPUs. Yes, it tries to complete 128B-aligned pairs of 64B lines, when there's spare memory bandwidth (off-core request tracking slots) when the first line is getting pulled in.
Your microbenchmark doesn't show any significant time difference between 64 vs. 128 byte separation. Without any actual false sharing (within the same 64 byte line), after some initial chaos, it quickly reaches a state where each core has exclusive ownership of the cache line it's modifying. That means no further L1d misses, thus no requests to L2 which would trigger the L2 spatial prefetcher.
Unlike if for example two pairs of threads contending over separate
atomic<int>variables in adjacent (or not) cache lines. Or false-sharing with them. Then L2 spatial prefetching could couple the contention together, so all 4 threads are contending with each other instead of 2 independent pairs. Basically any case where cache lines actually are bouncing back and forth between cores, L2 spatial prefetching can make it worse if you aren't careful.(The L2 prefetcher doesn't keep trying indefinitely to complete pairs of lines of every valid line it's caching; that would hurt cases like this where different cores are repeatedly touching adjacent lines, more than it helps anything.)
Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size includes an answer with a longer benchmark; I haven't looked at it recently, but I think it's supposed to demo destructive interference at 64 bytes but not 128. The answer there unfortunately doesn't mention L2 spatial prefetching as one of the effects that can cause some destructive interference (although not as much as a 128-byte line size in an outer level of cache, especially not if it's an inclusive cache).
Perf counters reveal a difference even with your benchmark
There is more initial chaos that we can measure with performance counters for your benchmark. On my i7-6700k (quad core Skylake with Hyperthreading; 4c8t, running Linux 5.16), I improved the source so I could compile with optimization without defeating the memory access, and with a CPP macro so I could set the stride (in bytes) from the compiler command line. Note the ~500 memory-order mis-speculation pipeline nukes (
machine_clears.memory_ordering) when we use adjacent lines. Actual number is quite variable, from 200 to 850, but there's still negligible effect on the overall time.Adjacent lines, 500 +- 300 machine clears
vs with 128-byte separation, only a very few machine clears
Presumably there's some dependence on how Linux schedules threads to logical cores on this 4c8t machine. Related:
vs. actual false-sharing within one line: 10M machine clears
The store buffer (and store forwarding) get a bunch of increments done for every false-sharing machine clear so it's not as bad as one might expect. (And would be far worse with atomic RMWs, like
std::atomic<int>fetch_add, since there every single increment needs direct access to L1d cache as it executes.) Why does false sharing still affect non atomics, but much less than atomics?Improved benchmark- align the array, and volatile to allow optimization
I used
volatileso I could enable optimization. I assume you compiled with optimization disabled, soint jwas also getting stored/reloaded inside the loop.And I used
alignas(128) counter[]so we'd be sure the start of the array was in two pairs of 128-byte lines, not spread across three.