Why accessing an array of int8_t is not faster than int32_t, due to cache?

235 views Asked by At

I have read that when accessing with a stride

for (int i = 0; i < aSize; i++) a[i] *= 3;

for (int i = 0; i < aSize; i += 16) a[i] *= 3;

both loops should perform similarly, as memory accesses are in a higher order than multiplication.

I'm playing around with google benchmark and while testing similar cache behaviour, I'm getting results that I don't understand.

template <class IntegerType>
void BM_FillArray(benchmark::State& state) {
    for (auto _ : state)
    {
        IntegerType a[15360 * 1024 * 2]; // Reserve array that doesn't fit in L3
        for (size_t i = 0; i < sizeof(a) / sizeof(IntegerType); ++i)
            benchmark::DoNotOptimize(a[i] = 0); // I have compiler optimizations disabled anyway
    }
}
BENCHMARK_TEMPLATE(BM_FillArray, int32_t);
BENCHMARK_TEMPLATE(BM_FillArray, int8_t);
Run on (12 X 3592 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 15360 KiB (x1)
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_FillArray<int32_t>     196577075 ns    156250000 ns            4
BM_FillArray<int8_t>      205476725 ns    160156250 ns            4

I would expect accessing the array of bytes to be faster than the array of ints as more elements fit in a cache line, but this is not the case.

Here are the results with optimizations enabled:

BM_FillArray<int32_t>   47279657 ns     47991071 ns           14
BM_FillArray<int8_t>    49374830 ns     50000000 ns           10

Anyone please can clarify this? Thanks :)

UPDATE 1:

I have read the old article "What programmers should know about memory" and everything is more clear now. However, I've tried the following benchmark:

template <int32_t CacheLineSize>
void BM_ReadArraySeqCacheLine(benchmark::State& state) {

    struct CacheLine
    {
        int8_t a[CacheLineSize];
    };
    vector<CacheLine> cl;
    int32_t workingSetSize = state.range(0);
    int32_t arraySize = workingSetSize / sizeof(CacheLine);
    cl.resize(arraySize);

    const int32_t iterations = 1536 * 1024;

    for (auto _ : state)
    {
        srand(time(NULL));
        int8_t res = 0;
        int32_t i = 0;
        while (i++ < iterations)
        {
            //size_t idx = i% arraySize;
            int idx = (rand() / float(RAND_MAX)) * arraySize;
            benchmark::DoNotOptimize(res += cl[idx].a[0]);
        }
    }
}
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 1)
    ->Arg(32 * 1024)    // L1 Data 32 KiB(x6)
    ->Arg(256 * 1024)   // L2 Unified 256 KiB(x6)
    ->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 64)
    ->Arg(32 * 1024)    // L1 Data 32 KiB(x6)
    ->Arg(256 * 1024)   // L2 Unified 256 KiB(x6)
    ->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 128)
    ->Arg(32 * 1024)    // L1 Data 32 KiB(x6)
    ->Arg(256 * 1024)   // L2 Unified 256 KiB(x6)
    ->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)

I would expect random accesses to perform much worse when working size doesn't fit the caches. However, these are the results:

BM_ReadArraySeqCacheLine<1>/32768        39936129 ns     38690476 ns           21
BM_ReadArraySeqCacheLine<1>/262144       40822781 ns     39062500 ns           16
BM_ReadArraySeqCacheLine<1>/15728640     58144300 ns     57812500 ns           10
BM_ReadArraySeqCacheLine<64>/32768       32786576 ns     33088235 ns           17
BM_ReadArraySeqCacheLine<64>/262144      32066729 ns     31994048 ns           21
BM_ReadArraySeqCacheLine<64>/15728640    50734420 ns     50000000 ns           10
BM_ReadArraySeqCacheLine<128>/32768      29122832 ns     28782895 ns           19
BM_ReadArraySeqCacheLine<128>/262144     31991964 ns     31875000 ns           25
BM_ReadArraySeqCacheLine<128>/15728640   68437327 ns     68181818 ns           11

what am I missing?

UPDATE 2:

I'm using now what you suggested (linear_congruential_engine) to generate the random numbers, and I'm using only static arrays, but the results are now even more confusing to me.

Here is the updated code:

template <int32_t WorkingSetSize, int32_t ElementSize>
void BM_ReadArrayRndCacheLine(benchmark::State& state) {

    struct Element
    {
        int8_t data[ElementSize];
    };
    constexpr int32_t ArraySize = WorkingSetSize / sizeof(ElementSize);
    Element a[ArraySize];

    constexpr int32_t iterations = 1536 * 1024;
    linear_congruential_engine<size_t, ArraySize/10, ArraySize/10, ArraySize> lcg; // I've tried with many params...
    
    for (auto _ : state)
    {
        int8_t res = 0;
        int32_t i = 0;
        while (i++ < iterations)
        {
            size_t idx =  lcg();
            benchmark::DoNotOptimize(res += a[idx].data[0]);
        }
    }
}

// L1 Data 32 KiB(x6)
// L2 Unified 256 KiB(x6)
// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 128);

BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 128);

BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 128);

BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 128);

Here are the results (optimizations enabled):

// First template parameter is working set size.
// Second template parameter is array elemeent size.
BM_ReadArrayRndCacheLine<32 * 1024, 1>             2833786 ns      2823795 ns          249
BM_ReadArrayRndCacheLine<32 * 1024, 64>            2960200 ns      2979343 ns          236
BM_ReadArrayRndCacheLine<32 * 1024, 128>           2896079 ns      2910539 ns          204

BM_ReadArrayRndCacheLine<256 * 1024, 1>            3114670 ns      3111758 ns          236
BM_ReadArrayRndCacheLine<256 * 1024, 64>           3629689 ns      3643135 ns          193
BM_ReadArrayRndCacheLine<256 * 1024, 128>          3213500 ns      3187189 ns          201

BM_ReadArrayRndCacheLine<15360 * 1024, 1>          5782703 ns      5729167 ns           90
BM_ReadArrayRndCacheLine<15360 * 1024, 64>         5958600 ns      6009615 ns          130
BM_ReadArrayRndCacheLine<15360 * 1024, 128>        5958221 ns      5998884 ns          112

BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 1>      6143701 ns      6076389 ns           90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 64>     5800649 ns      5902778 ns           90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 128>    5826414 ns      5729167 ns           90

How is it possible that for (L1d < workingSet < L2) results do not differ much against (workingSet < L1d)? Throughput and latency of L2 are still very high, but with the random accesses I'm trying to prevent prefetching and force cache misses.. so, why I'm not even noticing a minimal increment?

Even when trying to fetch from main memory (workingSet > L3) I'm not getting a massive performance drop. You mention that latest architectures can hold bandwidths of up to ~8bytes per clock, but I understand that they must copy a hold cache line, and that without prefetching with a predictable linear pattern, latency should be more noticeable in my tests... why is not the case?

I suspect that page faults and tlb may have something to do too.

(I've downloaded vtune analyzer to try to understand all this stuff better, but it's hanging on my machine and I've waiting for support)

I REALLY appreciate your help Peter Cordes :)

I'm just a GAME programmer trying to show my teammates whether using certain integer types in our code might (or not) have implications on our game performance. For example, whether we should worry about using fast types (eg. int_fast16_t) or using the least possible bytes in our variables for better packing (eg. int8_t).

1

There are 1 answers

2
Peter Cordes On BEST ANSWER

Re: the ultimate question: int_fast16_t is garbage for arrays because glibc on x86-64 unfortunately defines it as a 64-bit type (not 32-bit), so it wastes huge amounts of cache footprint. The question is "fast for what purpose", and glibc answered "fast for use as array indices / loop counters", apparently, even though it's slower to divide, or to multiply on some older CPUs (which were current when the choice was made). IMO this was a bad design decision.

Generally using small integer types for arrays is good; usually cache misses are a problem so reducing your footprint is nice even if it means using a movzx or movsx load instead of a memory source operand to use it with an int or unsigned 32-bit local. If SIMD is ever possible, having more elements per fixed-width vector means you get more work done per instruction.

But unfortunately int_fast16_t isn't going to help you achieve that with some libraries, but short will, or int_least16_t.


See my comments under the question for answers to the early part: 200 stall cycles is latency, not throughput. HW prefetch and memory-level parallelism hide that. Modern Microprocessors - A 90 Minute Guide! is excellent, and has a section on memory. See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? which is still highly relevant in 2021. (Except for some stuff about prefetch threads.)


Your Update 2 with a faster PRNG

Re: why L2 isn't slower than L1: out-of-order exec is sufficient to hide L2 latency, and even your LGC is too slow to stress L2 throughput. It's hard to generate random numbers fast enough to give the available memory-level parallelism much trouble.

Your Skylake-derived CPU has an out-of-order scheduler (RS) of 97 uops, and a ROB size of 224 uops (like https://realworldtech.com/haswell-cpu/3 but larger), and 12 LFBs to track cache lines it's waiting for. As long as the CPU can keep track of enough in-flight loads (latency * bandwidth), having to go to L2 is not a big deal. Ability to hide cache misses is one way to measure out-of-order window size in the first place: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity


Latency for an L2 hit is 12 cycles (https://www.7-cpu.com/cpu/Skylake.html). Skylake can do 2 loads per clock from L1d cache, but not from L2. (It can't sustain 1 cache line per clock IIRC, but 1 per 2 clocks or even somewhat better is doable).

Your LCG RNG bottlenecks your loop on its latency: 5 cycles for power-of-2 array sizes, or more like 13 cycles for non-power-of-2 sizes like your "L3" test attempts1. So that's about 1/10th the access rate that L1d can handle, and even if every access misses L1d but hits in L2, you're not even keeping more than one load in flight from L2. OoO exec + load buffers aren't even going to break a sweat. So L1d and L2 will be the same speed because they both user power-of-2 array sizes.

note 1: imul(3c) + add(1c) for x = a * x + c, then remainder = x - (x/m * m) using a multiplicative inverse, probably mul(4 cycles for high half of size_t?) + shr(1) + imul(3c) + sub(1c). Or with a power-of-2 size, modulo is just AND with a constant like (1UL<<n) - 1.

Clearly my estimates aren't quite right because your non-power-of-2 arrays are less than twice the times of L1d / L2, not 13/5 which my estimate would predict even if L3 latency/bandwidth wasn't a factor.

Running multiple independent LCGs in an unrolled loop could make a difference. (With different seeds.) But a non-power-of-2 m for an LCG still means quite a few instructions so you would bottleneck on CPU front-end throughput (and back-end execution ports, specifically the multiplier).

An LCG with multiplier (a) = ArraySize/10 is probably just barely a large enough stride for the hardware prefetcher to not benefit much from locking on to it. But normally IIRC you want a large odd number or something (been a while since I looked at the math of LCG param choices), otherwise you risk only touching a limited number of array elements, not eventually covering them all. (You could test that by storing a 1 to every array element in a random loop, then count how many array elements got touched, i.e. by summing the array, if other elements are 0.)

a and c should definitely not both be factors of m, otherwise you're accessing the same 10 cache lines every time to the exclusion of everything else.

As I said earlier, it doesn't take much randomness to defeat HW prefetch. An LCG with c=0, a= an odd number, maybe prime, and m=UINT_MAX might be good, literally just an imul. You can modulo to your array size on each LCG result separately, taking that operation off the critical path. At this point you might as well keep the standard library out of it and literally just unsigned rng = 1; to start, and rng *= 1234567; as your update step. Then use arr[rng % arraysize].

That's cheaper than anything you could do with xorshift+ or xorshft*.


Benchmarking cache latency:

You could generate an array of random uint16_t or uint32_t indices once (e.g. in a static initializer or constructor) and loop over that repeatedly, accessing another array at those positions. That would interleave sequential and random access, and make code that could probably do 2 loads per clock with L1d hits, especially if you use gcc -O3 -funroll-loops. (With -march=native it might auto-vectorize with AVX2 gather instructions, but only for 32-bit or wider elements, so use -fno-tree-vectorize if you want to rule out that confounding factor that only comes from taking indices from an array.)

To test cache / memory latency, the usual technique is to make linked lists with a random distribution around an array. Walking the list, the next load can start as soon as (but not before) the previous load completes. Because one depends on the other. This is called "load-use latency". See also Is there a penalty when base+offset is in a different page than the base? for a trick Intel CPUs use to optimistically speed up workloads like that (the 4-cycle L1d latency case, instead of the usual 5 cycle). Semi-related: PyPy 17x faster than Python. Can Python be sped up? is another test that's dependent on pointer-chasing latency.