In the following code I am performing an XOR operation on two arrays result and DB, the result is accessed after an offset called rotate1 in the following. As you can see, I am already doing AVX2, loop unrolling, and prefetching. I am wondering if I am missing anything that might be giving slow speed. In the following else part of the branch is accessed only once each time the function is called. I have noticed that 50 percent of the time is spent on xor, and the rest 40 percent is spent on the data store. I am remaining on loads.
void perform_array_xor(uint32_t partindex, uint32_t offset, uint64_t *result, uint32_t EntrySize, uint32_t PartSize)
{
auto B = 1;
assert(EntrySize/8==B);
// Ensure that PartSize is a multiple of 32 for this example
if (PartSize % 8 != 0)
{
// Handle this case
return;
}
__m256i a,b,r;
unsigned int rotate1_1;
int k;
for (int i = 0; i < PartSize; i += 8)
{
rotate1_1 = (i + offset) & (PartSize - 1);
_mm_prefetch(result + rotate1_1, _MM_HINT_T2);
k = 0;
if(rotate1_1 + 7 < PartSize){
a = _mm256_loadu_si256((__m256i*)(result + rotate1_1));
b = _mm256_loadu_si256((__m256i*)(DB + partindex + i));
r = _mm256_xor_si256(a, b);
_mm256_storeu_si256((__m256i*)(result + rotate1_1), r);
//std::memcpy(result + rotate1_1, &r, sizeof(__m256i));
k = 4 ;
a = _mm256_loadu_si256((__m256i*)(result + rotate1_1 + k));
b = _mm256_loadu_si256((__m256i*)(DB + partindex + i + k));
r = _mm256_xor_si256(a, b);
_mm256_storeu_si256((__m256i*)(result + rotate1_1 + k), r);
//std::memcpy(result + rotate1_1 + k, &r, sizeof(__m256i));
}
else{
result[(rotate1_1 + 0) & (PartSize - 1)] ^= DB[partindex + (i + 0)];
result[(rotate1_1 + 1) & (PartSize - 1)] ^= DB[partindex + (i + 1)];
result[(rotate1_1 + 2) & (PartSize - 1)] ^= DB[partindex + (i + 2)];
result[(rotate1_1 + 3) & (PartSize - 1)] ^= DB[partindex + (i + 3)];
result[(rotate1_1 + 4) & (PartSize - 1)] ^= DB[partindex + (i + 4)];
result[(rotate1_1 + 5) & (PartSize - 1)] ^= DB[partindex + (i + 5)];
result[(rotate1_1 + 6) & (PartSize - 1)] ^= DB[partindex + (i + 6)];
result[(rotate1_1 + 7) & (PartSize - 1)] ^= DB[partindex + (i + 7)];
}
}
}
Update: Here are more implementation details about the function
- DB array is of size 2^28 so in total 2GB of data
- the result is an array of size 2^14 so in total 128KB
- the result is the same across function calls
- On every function call contiguous 2^14 entries in the DB are accessed
- Currently, I see all the DB is processed in 143ms which is around 13 GB/s
- Optimizing for i7, 9th generation and clang compiler
This means
vpxoris getting the blame because it's waiting for a load that's slow to produce a result. (And presumably it compiles to a memory-sourcevpxorinstruction, so there's also a load built-in to the XOR itself.)Normally you'd write it as code after the loop, instead of an
ifinside the main loop, even if that means you have to declareiin a scope outside the loop. Or if you can compute the right array indices for that final iteration without needing the finalifrom the vectorized iterations, just do that.i7-9xxx is still Skylake-derived, so 256KiB L2 cache. One might hope that 128KiB
resultwould stay hot in L2 cache across calls with differentparts, but check performance counters to find out.Probably look at
l2_lines_out.non_silentto see the rate of dirty write-back from L2, as opposed to clean evictions of non-modified data that you were only reading. Maybe alsol2_rqsts.rfo_hitvs.l2_rqsts.rfo_miss, but no that's probably always low since a demand miss for the load side ofresult ^=will come first. So the store comes after a demand load on the same address, meaning you probably won't see many RFO misses (since with no other thread accessing the same line, the initial load will get MESI Exclusive ownership.) That means you probably won't see counts forresource_stalls.sb(store buffer full) either.13 GiB/s is disappointingly low for a i7-9xxx with dual-channel memory. (Unless that
ifinside the loop was slowing things down. But probably not. With 32-byte vectors and 2 loads + 1 store per vector op, CPU throughput is so much higher than DRAM bandwidth that there's room for some inefficiency in the loop even if the compiler doesn't do loop unswitching or peeling of the final special iteration, or whatever name applies for this optimization.)If result was staying hot in L2 cache, you'd hope to come a lot closer to max DRAM bandwidth, e.g. something over 30GiB/s on a system with DDR4-2666 for a theoretical max of 41.6 GiB/s. (Intel "client" (non-server) CPUs typically have low enough latency to DRAM that one core can nearly saturate the memory controllers: Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?)
Perhaps worth trying cache-blocking. Do each
partindexfor the first 32 or 64KiB ofresult, so even with the data fromDBstreaming through L2, it will hopefully stay hot. Then loop over the rest ofresult. You're still only touching each byte ofDB(that you access at all) once, and with a block size that's a multiple of 4KiB, you hopefully avoid useless hardware prefetch ifDBand your parts are 4K-aligned. (Although mis-speculated loads may cross into those pages.)You could also do 2 parts in parallel, so each vector from
resultis XORed twice before storing again. But that means touching more data between each access to that part ofresult, making it more likely to be evicted before you get back to it. If you can do multiple streams while still getting L2 hits for result, it would take out some of the store/reload work. But if that's hitting in L2, it shouldn't be hurting memory parallelism for the off-core accesses much.Perhaps interesting to try, though, maybe even with 4 streams. But Skylake L2 cache is only 4-way associative, and if all the accesses are offset from each other by a multiple of 128 KiB, they'll all alias the same set in L2, creating conflict misses. But L1d is still 8-way, so 4 input streams are probably ok on that score.
If you're tuning just for one particular CPU, you could also use NT prefetch to bypass L2, prefetching only into L1 (and one way of L3 since it's inclusive in Intel client CPUs). (Prefetch distance is pretty sensitive to the CPU and conditions, especially with NT prefetch. Too early and data is evicted again before you load it, which is extra bad with NT prefetch since it's not even in L2. Too late and you still get a demand load, which may pull it into L2 defeating the attempt to use NT prefetches.)
Other than NT prefetch, prefetch instructions are not typically helpful for sequential access on modern CPUs (especially Intel); hardware prefetch and out-of-order exec are enough. (How much of ‘What Every Programmer Should Know About Memory’ is still valid?). Although
gcc -mtune=znver2or similar AMD CPUs do IIRC sometimes generate prefetch instructions when auto-vectorizing. And if your loop doesn't have much else to do, running prefetch instructions normally isn't going to hurt, except on Ivy Bridge which apparently some sort of SW prefetch throughput bug.