Interpreting Absurdly-Low Measured Latency in Careful Profile (Superscalarity Effects?)

157 views Asked by At

I've written some code for profiling small functions. At the high level it:

  1. Sets the thread affinity to only one core and the thread priority to maximum.
  2. Computes statistics from doing the following 100 times:

    1. Estimate the latency of a function that does nothing.
    2. Estimate the latency of the test function.
    3. Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.

To estimate the latency of a function, it:

  1. Invalidates caches (this is difficult to actually do in user-mode, but I allocate and write a buffer the size of the L3 to memory, which should maybe help).
  2. Yields the thread, so that the profile loop has as-few-as-possible context switches.
  3. Gets the current time from a std::chrono::high_resolution_clock (which seems to compile to system_clock, but).
  4. Runs the profile loop 100,000,000 times, calling the tested function within.
  5. Gets the current time from a std::chrono::high_resolution_clock and subtracts to get latency.

Because at this level, individual instructions matter, at all points we have to write very careful code to ensure that the compiler doesn't elide, inline, cache, or treat-differently the functions. I have manually validated the generated assembly in various test cases, including the one which I present below.


I am getting extremely low (sub-nanosecond) latencies reported in some cases. I have tried everything I can think of to account for this, but cannot find an error.

I am looking for an explanation accounting for this behavior. Why are my profiled functions taking so little time?


Let's take the example of computing a square root for float.

The function signature is float(*)(float), and the empty function is trivial:

empty_function(float):
    ret

Let's compute the square root by using the sqrtss instruction, and by the multiplication-by-reciprocal-square-root hack. I.e., the tested functions are:

sqrt_sseinstr(float):
    sqrtss  xmm0, xmm0
    ret
sqrt_rcpsseinstr(float):
    movaps  xmm1, xmm0
    rsqrtss xmm1, xmm0
    mulss   xmm0, xmm1
    ret

Here's the profile loop. Again, this same code is called with the empty function and with the test functions:

double profile(float):
    ...

    mov    rbp,rdi

    push   rbx
    mov    ebx, 0x5f5e100

    call   1c20 <invalidate_caches()>
    call   1110 <sched_yield()>

    call   1050 <std::chrono::high_resolution_clock::now()>

    mov    r12, rax
    xchg   ax,  ax

    15b0:
    movss  xmm0,DWORD PTR [rip+0xba4]
    call   rbp
    sub    rbx, 0x1
    jne    15b0 <double profile(float)+0x20>

    call   1050 <std::chrono::high_resolution_clock::now()>

    ...

The timing result for sqrt_sseinstr(float) on my Intel 990X is 3.60±0.13 nanoseconds. At this processor's rated 3.46 GHz, that works out to be 12.45±0.44 cycles. This seems pretty spot-on, given that the docs say the latency of sqrtss is around 13 cycles (it's not listed for this processor's Nehalem architecture, but it seems likely to also be around 13 cycles).

the timing result for sqrt_rcpsseinstr(float) is stranger: 0.01±0.07 nanoseconds (or 0.02±0.24 cycles). This is flatly implausible unless another effect is going on.

I thought perhaps the processor is able to hide the latency of the tested function somewhat or perfectly because the tested function uses different instruction ports (i.e. superscalarity is hiding something)? I tried to analyze this by hand, but didn't get very far because I didn't really know what I was doing.

(Note: I cleaned up some of the assembly notation for your convenience. An unedited objdump of the whole program, which includes several other variants, is here, and I am temporarily hosting the binary here (x86-64 SSE2+, Linux).)


The question, again: Why are some profiled functions producing implausibly small values? If it is a higher-order effect, explain?

2

There are 2 answers

0
BeeOnRope On

The problem is with the basic approach of subtracting out the "latency"1 of an empty function, as described:

  1. Estimate the latency of a function that does nothing.
  2. Estimate the latency of the test function.
  3. Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.

The built-in assumption is that the cost of calling a function is X, and if the latency of the work done in the function is Y, then the total cost will be something like X + Y.

This isn't generally true for any two blocks of work and especially isn't true when when one of them is "calling a function". A more sophisticated view would be that the total time would be somewhere between min(X, Y) and X + Y - but even this is often wrong depending on the details. Still, it's enough of a refinement to explain what is going on here: the cost of the function is not additive with the work being doing in the function: they happen in parallel.

The cost of an empty function call is something like 4 to 5 cycles on modern Intel, probably bottlenecked on the front-end throughput for the two taken branches, and possibly by branch and return predictor latency.

However, when you add additional work to an empty function, it generally won't compete for the same resources, and its execution instructions won't depend on the "output" of the call (i.e., the work will form a separate dependency chain), except perhaps in rare cases where the stack pointer is manipulated and the stack engine doesn't remove the dependency.

So essentially the function will take the greater of the time needed for the function call mechanics, or the actual work done by the function. This approximation isn't exact, because some types of work may actually add to the overhead of the function call (e.g., if there are enough instructions for the front end to get through before getting to the ret, the total time may increase on top of the 4-5 cycle empty function time, even if the total work is less than that) - but it's a good first order approximation.

Your first function takes enough time that the actual work dominates the execution time. The second function is much faster, however, enabling it to "hide under" the existing time taken by the call/ret mechanics.

The solution is simple: duplicate the work within the function N times, so that the work always dominates. N=10 or N=50 or something like that is fine. You have to decide whether you want to test latency, in which case the output of one copy of the work should feed into the next, or throughput, in which case it shouldn't.

On other hand, if you actually want to test the cost of the function call + work, e.g., because that's how you'll be using it in real life, it is likely the results you have gotten is already close to correct: stuff really can be "incrementally free" when it hides behind a function call.


1 I'm putting "latency" in quotes here because it isn't clear whether we should be talking about the latency of call/ret or the throughput. call and ret don't have any explicit outputs (and ret has no input), so it doesn't participate in a classic register-based dependency chain - but it might make sense to think of latency if you consider other hidden architectural components like the instruction pointer. In either case latency of throughput mostly points down to the same thing because all call and ret on a thread operate on the same state, so it doesn't make sense to have say "independent" vs "dependent" call chains.

7
EOF On

Your benchmarking approach is fundamentally wrong, and your "careful code" is bogus.

First, emptying the cache is bogus. Not only will it quickly be repopulated with the required data, but also the examples you have posted have very little memory interaction (only cache access by call/ret and a load we'll get to.

Second, yielding before the benchmarking loop is bogus. You iterate 100000000 times, which even on a reasonably fast modern processor will take longer than typical scheduling clock interrupts on a stock operating system. If, on the other hand, you disable scheduling clock interrupts, then yielding before the benchmark doesn't do anything.

Now that the useless incidental complexity is out of the way, about the fundamental misunderstanding of modern CPUs:

You expect loop_time_gross/loop_count to be the time spent in each loop iteration. This is wrong. Modern CPUs do not execute instructions one after the other, sequentially. Modern CPUs pipeline, predict branches, execute multiple instructions in parallel, and (reasonably fast CPUs) out of order.

So after the first handful of iterations of the benchmarking loop, all branches are perfectly predicted for the next almost 100000000 iterations. This enables the CPU to speculate. Effectively, the conditional branch in the benchmarking loop goes away, as does most of the cost of the indirect call. Effectively, the CPU can unroll the loop:

movss  xmm0, number
movaps  xmm1, xmm0
rsqrtss xmm1, xmm0
mulss   xmm0, xmm1
movss  xmm0, number
movaps  xmm1, xmm0
rsqrtss xmm1, xmm0
mulss   xmm0, xmm1
movss  xmm0, number
movaps  xmm1, xmm0
rsqrtss xmm1, xmm0
mulss   xmm0, xmm1
...

or, for the other loop

movss  xmm0, number
sqrtss  xmm0, xmm0
movss  xmm0, number
sqrtss  xmm0, xmm0
movss  xmm0, number
sqrtss  xmm0, xmm0
...

Notable, the load of number is always the same (thus quickly cached), and it overwrites the just computed value, breaking the dependency chain.

To be fair, the

call   rbp
sub    rbx, 0x1
jne    15b0 <double profile(float)+0x20>

are still executed, but the only resources they take from the floating-point code are decode/micro-op cache and execution ports. Notably, while the integer loop code has a dependency chain (ensuring a minimum execution time), the floating-point code does not carry a dependency on it. Furthermore, the floating-point code consists of many mutually totally independent short dependency chains.

Where you expect the CPU to execute instructions sequentially, the CPU can instead execute them in parallel.

A small look at https://agner.org/optimize/instruction_tables.pdf reveals why this parallel execution doesn't work for sqrtss on Nehalem:

instruction: SQRTSS/PS
latency: 7-18
reciprocal throughput: 7-18

i.e., the instruction cannot be pipelined and only runs on one execution port. In contrast, for movaps, rsqrtss, mulss:

instruction: MOVAPS/D
latency: 1
reciprocal throughput: 1

instruction: RSQRTSS
latency: 3
reciprocal throughput: 2

instruction: MULSS
latency: 4
reciprocal throughput: 1

the maximum reciprocal throughput of the dependency chain is 2, so you can expect the code to finish executing one dependency chain every 2 cycles in the steady state. At this point, the execution time of the floating-point part of the benchmarking loop is less than or equal to the loop overhead and overlapped with it, so your naive approach to subtract the loop overhead leads to nonsensical results.

If you wanted to do this properly, you would ensure that separate loop iterations are dependent on each other, for example by changing your benchmarking loop to

float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
    x = benchmarked_function(x);

Obviously you will not benchmark the same input this way, unless INITIAL_VALUE is a fixed point of benchmarked_function(). However, you can arrange for it to be a fixed point of an expanded function by computing float diff = INITIAL_VALUE - benchmarked_function(INITIAL_VALUE); and then making the loop

float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
    x = diff + benchmarked_function(x);

with relatively minor overhead, though you should then ensure that floating-point errors do not accumulate to significantly change the value passed to benchmarked_function().