How can I further optimize this code regarding array operations?

Question

How can I further optimize this code regarding array operations?

138 views Asked by CryptoKitty At 19 September 2023 at 04:52

In the following code I am performing an XOR operation on two arrays result and DB, the result is accessed after an offset called rotate1 in the following. As you can see, I am already doing AVX2, loop unrolling, and prefetching. I am wondering if I am missing anything that might be giving slow speed. In the following else part of the branch is accessed only once each time the function is called. I have noticed that 50 percent of the time is spent on xor, and the rest 40 percent is spent on the data store. I am remaining on loads.

void perform_array_xor(uint32_t partindex, uint32_t offset, uint64_t *result, uint32_t EntrySize, uint32_t PartSize)
{
    auto B = 1;

    assert(EntrySize/8==B);

    // Ensure that PartSize is a multiple of 32 for this example
    if (PartSize % 8 != 0)
    {
        // Handle this case
        return;
    }

     __m256i a,b,r;

     unsigned int rotate1_1;
     int k;
    
    for (int i = 0; i < PartSize; i += 8)
    {
        rotate1_1 = (i + offset) & (PartSize - 1);


        _mm_prefetch(result + rotate1_1, _MM_HINT_T2);
        k = 0;
        if(rotate1_1 + 7 < PartSize){
            a = _mm256_loadu_si256((__m256i*)(result + rotate1_1));
            b = _mm256_loadu_si256((__m256i*)(DB + partindex + i));
            r = _mm256_xor_si256(a, b);
            _mm256_storeu_si256((__m256i*)(result + rotate1_1), r);
            //std::memcpy(result + rotate1_1, &r, sizeof(__m256i));
            
            k = 4 ;
            a = _mm256_loadu_si256((__m256i*)(result + rotate1_1 + k));
            b = _mm256_loadu_si256((__m256i*)(DB + partindex + i + k));
            r = _mm256_xor_si256(a, b);
            _mm256_storeu_si256((__m256i*)(result + rotate1_1 + k), r);
            //std::memcpy(result + rotate1_1 + k, &r, sizeof(__m256i));
            
        }
        else{
        result[(rotate1_1 + 0) & (PartSize - 1)] ^= DB[partindex + (i + 0)];
        result[(rotate1_1 + 1) & (PartSize - 1)] ^= DB[partindex + (i + 1)];
        result[(rotate1_1 + 2) & (PartSize - 1)] ^= DB[partindex + (i + 2)];
        result[(rotate1_1 + 3) & (PartSize - 1)] ^= DB[partindex + (i + 3)];
        result[(rotate1_1 + 4) & (PartSize - 1)] ^= DB[partindex + (i + 4)];
        result[(rotate1_1 + 5) & (PartSize - 1)] ^= DB[partindex + (i + 5)];
        result[(rotate1_1 + 6) & (PartSize - 1)] ^= DB[partindex + (i + 6)];
        result[(rotate1_1 + 7) & (PartSize - 1)] ^= DB[partindex + (i + 7)];
        }
        
    }
}

Update: Here are more implementation details about the function

DB array is of size 2^28 so in total 2GB of data
the result is an array of size 2^14 so in total 128KB
the result is the same across function calls
On every function call contiguous 2^14 entries in the DB are accessed
Currently, I see all the DB is processed in 143ms which is around 13 GB/s
Optimizing for i7, 9th generation and clang compiler

Original Q&A

There are 1 answers

**Peter Cordes** · Answer 1 · 2023-09-20T00:52:24+00:00

I have noticed that 50 percent of the time is spent on xor, and the rest 40 percent is spent on the data store. I am remaining on loads.

This means vpxor is getting the blame because it's waiting for a load that's slow to produce a result. (And presumably it compiles to a memory-source vpxor instruction, so there's also a load built-in to the XOR itself.)

else part of branch is accessed only once each time the function is called

Normally you'd write it as code after the loop, instead of an if inside the main loop, even if that means you have to declare i in a scope outside the loop. Or if you can compute the right array indices for that final iteration without needing the final i from the vectorized iterations, just do that.

i7-9xxx is still Skylake-derived, so 256KiB L2 cache. One might hope that 128KiB result would stay hot in L2 cache across calls with different parts, but check performance counters to find out.

Probably look at l2_lines_out.non_silent to see the rate of dirty write-back from L2, as opposed to clean evictions of non-modified data that you were only reading. Maybe also l2_rqsts.rfo_hit vs. l2_rqsts.rfo_miss, but no that's probably always low since a demand miss for the load side of result ^= will come first. So the store comes after a demand load on the same address, meaning you probably won't see many RFO misses (since with no other thread accessing the same line, the initial load will get MESI Exclusive ownership.) That means you probably won't see counts for resource_stalls.sb (store buffer full) either.

13 GiB/s is disappointingly low for a i7-9xxx with dual-channel memory. (Unless that if inside the loop was slowing things down. But probably not. With 32-byte vectors and 2 loads + 1 store per vector op, CPU throughput is so much higher than DRAM bandwidth that there's room for some inefficiency in the loop even if the compiler doesn't do loop unswitching or peeling of the final special iteration, or whatever name applies for this optimization.)

If result was staying hot in L2 cache, you'd hope to come a lot closer to max DRAM bandwidth, e.g. something over 30GiB/s on a system with DDR4-2666 for a theoretical max of 41.6 GiB/s. (Intel "client" (non-server) CPUs typically have low enough latency to DRAM that one core can nearly saturate the memory controllers: Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?)

Perhaps worth trying cache-blocking. Do each partindex for the first 32 or 64KiB of result, so even with the data from DB streaming through L2, it will hopefully stay hot. Then loop over the rest of result. You're still only touching each byte of DB (that you access at all) once, and with a block size that's a multiple of 4KiB, you hopefully avoid useless hardware prefetch if DB and your parts are 4K-aligned. (Although mis-speculated loads may cross into those pages.)

You could also do 2 parts in parallel, so each vector from result is XORed twice before storing again. But that means touching more data between each access to that part of result, making it more likely to be evicted before you get back to it. If you can do multiple streams while still getting L2 hits for result, it would take out some of the store/reload work. But if that's hitting in L2, it shouldn't be hurting memory parallelism for the off-core accesses much.

Perhaps interesting to try, though, maybe even with 4 streams. But Skylake L2 cache is only 4-way associative, and if all the accesses are offset from each other by a multiple of 128 KiB, they'll all alias the same set in L2, creating conflict misses. But L1d is still 8-way, so 4 input streams are probably ok on that score.

If you're tuning just for one particular CPU, you could also use NT prefetch to bypass L2, prefetching only into L1 (and one way of L3 since it's inclusive in Intel client CPUs). (Prefetch distance is pretty sensitive to the CPU and conditions, especially with NT prefetch. Too early and data is evicted again before you load it, which is extra bad with NT prefetch since it's not even in L2. Too late and you still get a demand load, which may pull it into L2 defeating the attempt to use NT prefetches.)

Other than NT prefetch, prefetch instructions are not typically helpful for sequential access on modern CPUs (especially Intel); hardware prefetch and out-of-order exec are enough. (How much of ‘What Every Programmer Should Know About Memory’ is still valid?). Although gcc -mtune=znver2 or similar AMD CPUs do IIRC sometimes generate prefetch instructions when auto-vectorizing. And if your loop doesn't have much else to do, running prefetch instructions normally isn't going to hurt, except on Ivy Bridge which apparently some sort of SW prefetch throughput bug.

TechQA.

How can I further optimize this code regarding array operations?

There are 1 answers

Related Questions in C++

Related Questions in C++11

Related Questions in AVX2

Related Questions in PREFETCH

Popular Questions

Trending Questions