How to demonstrate the impact of instruction cache limitations

462 views Asked by At

My orginial idea was to give an elegant code example, that would demonstrate the impact of instruction cache limitations. I wrote the following piece of code, that creates a large amount of identical functions, using template metaprogramming.

volatile int checksum;
void (*funcs[MAX_FUNCS])(void);

template <unsigned t> 
__attribute__ ((noinline)) static void work(void) { ++checksum; }

template <unsigned t> 
static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); }

template <> void create<0>(void) {  }

int main()
{
    create<MAX_FUNCS>();

    for (unsigned range = 1; range <= MAX_FUNCS; range *= 2)
    {
        checksum = 0;
        for (unsigned i = 0; i < WORKLOAD; ++i)
        {
            funcs[i % range]();
        }
    }

    return 0;
}

The outer loop varies the amount of different functions to be called using a jump table. For each loop pass, the time taken to invoke WORKLOAD functions is then measured. Now what are the results? The following chart shows the average run time per function call in relation to the used range. The blue line shows the data measured on a Core i7 machine. The comparative measurement, depicted by the red line, was carried out on a Pentium 4 machine. Yet when it comes to interpreting these lines, I seem to be somehow struggling...

chart

The only jumps of the piecewise constant red curve occur exactly where the total memory consumption for all functions within range exceed the capacity of one cache level on the tested machine, which has no dedicated instruction cache. For very small ranges (below 4 in this case) however, run time still increases with the amount of functions. This may be related to branch prediction efficiency, but since every function call reduces to an unconditional jump in this case, I'm not sure if there should be any branching penalty at all.

The blue curve behaves quite differently. Run time is constant for small ranges and increases logarithmic thereafter. Yet for larger ranges, the curve seems to be approaching a constant asymptote again. How exactly can the qualitative differences of both curves be explained?

I am currently using GCC MinGW Win32 x86 v.4.8.1 with g++ -std=c++11 -ftemplate-depth=65536 and no compiler optimization.

Any help would be appreciated. I am also interested in any idea on how to improve the experiment itself. Thanks in advance!

1

There are 1 answers

0
Leeor On

First, let me say that I really like how you've approached this problem, this is a really neat solution for intentional code bloating. However, there might still be several possible issues with your test -

  1. You also measure the warmup time. you didn't show where you've placed your time checks, but if it's just around the internal loop - then the first time until you reach range/2 you'd still enjoy the warmup of the previous outer iteration. Instead, measure only warm performance - run each internal iteration for several times (add another loop in the middle), and take the timestamp only after 1-2 rounds.

  2. You claim to have measure several cache levels, but your L1 cache is only 32k, which is where your graph ends. Even assuming this counts in terms of "range", each function is ~21 bytes (at least on my gcc 4.8.1), so you'll reach at most 256KB, which is only then scratching the size of your L2.

  3. You didn't specify your CPU model (i7 has at least 4 generations in the market now, Haswell, IvyBridge, SandyBridge and Nehalem). The differences are quite large, for example an additional uop-cache since Sandybrige with complicated storage rules and conditions. Your baseline is also complicating things, if I recall correctly the P4 had a trace cache which might also cause all sorts of performance impacts. You should check an option to disable them if possible.

  4. Don't forget the TLB - even though it probably doesn't play a role here in such a tightly organized code, the number of unique 4k pages should not exceed the ITLB (128 entries), and even before that you may start having collisions if your OS did not spread the physical code pages well enough to avoid ITLB collisions.