I've written some code for profiling small functions. At the high level it:
- Sets the thread affinity to only one core and the thread priority to maximum.
Computes statistics from doing the following 100 times:
- Estimate the latency of a function that does nothing.
- Estimate the latency of the test function.
- Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.
To estimate the latency of a function, it:
- Invalidates caches (this is difficult to actually do in user-mode, but I allocate and write a buffer the size of the L3 to memory, which should maybe help).
- Yields the thread, so that the profile loop has as-few-as-possible context switches.
- Gets the current time from a
std::chrono::high_resolution_clock
(which seems to compile tosystem_clock
, but). - Runs the profile loop 100,000,000 times, calling the tested function within.
- Gets the current time from a
std::chrono::high_resolution_clock
and subtracts to get latency.
Because at this level, individual instructions matter, at all points we have to write very careful code to ensure that the compiler doesn't elide, inline, cache, or treat-differently the functions. I have manually validated the generated assembly in various test cases, including the one which I present below.
I am getting extremely low (sub-nanosecond) latencies reported in some cases. I have tried everything I can think of to account for this, but cannot find an error.
I am looking for an explanation accounting for this behavior. Why are my profiled functions taking so little time?
Let's take the example of computing a square root for float
.
The function signature is float(*)(float)
, and the empty function is trivial:
empty_function(float):
ret
Let's compute the square root by using the sqrtss
instruction, and by the multiplication-by-reciprocal-square-root hack. I.e., the tested functions are:
sqrt_sseinstr(float):
sqrtss xmm0, xmm0
ret
sqrt_rcpsseinstr(float):
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
ret
Here's the profile loop. Again, this same code is called with the empty function and with the test functions:
double profile(float):
...
mov rbp,rdi
push rbx
mov ebx, 0x5f5e100
call 1c20 <invalidate_caches()>
call 1110 <sched_yield()>
call 1050 <std::chrono::high_resolution_clock::now()>
mov r12, rax
xchg ax, ax
15b0:
movss xmm0,DWORD PTR [rip+0xba4]
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
call 1050 <std::chrono::high_resolution_clock::now()>
...
The timing result for sqrt_sseinstr(float)
on my Intel 990X is 3.60±0.13 nanoseconds. At this processor's rated 3.46 GHz, that works out to be 12.45±0.44 cycles. This seems pretty spot-on, given that the docs say the latency of sqrtss
is around 13 cycles (it's not listed for this processor's Nehalem architecture, but it seems likely to also be around 13 cycles).
the timing result for sqrt_rcpsseinstr(float)
is stranger: 0.01±0.07 nanoseconds (or 0.02±0.24 cycles). This is flatly implausible unless another effect is going on.
I thought perhaps the processor is able to hide the latency of the tested function somewhat or perfectly because the tested function uses different instruction ports (i.e. superscalarity is hiding something)? I tried to analyze this by hand, but didn't get very far because I didn't really know what I was doing.
(Note: I cleaned up some of the assembly notation for your convenience. An unedited objdump
of the whole program, which includes several other variants, is here, and I am temporarily hosting the binary here (x86-64 SSE2+, Linux).)
The question, again: Why are some profiled functions producing implausibly small values? If it is a higher-order effect, explain?
The problem is with the basic approach of subtracting out the "latency"1 of an empty function, as described:
The built-in assumption is that the cost of calling a function is X, and if the latency of the work done in the function is Y, then the total cost will be something like
X + Y
.This isn't generally true for any two blocks of work and especially isn't true when when one of them is "calling a function". A more sophisticated view would be that the total time would be somewhere between
min(X, Y)
andX + Y
- but even this is often wrong depending on the details. Still, it's enough of a refinement to explain what is going on here: the cost of the function is not additive with the work being doing in the function: they happen in parallel.The cost of an empty function call is something like 4 to 5 cycles on modern Intel, probably bottlenecked on the front-end throughput for the two taken branches, and possibly by branch and return predictor latency.
However, when you add additional work to an empty function, it generally won't compete for the same resources, and its execution instructions won't depend on the "output" of the call (i.e., the work will form a separate dependency chain), except perhaps in rare cases where the stack pointer is manipulated and the stack engine doesn't remove the dependency.
So essentially the function will take the greater of the time needed for the function call mechanics, or the actual work done by the function. This approximation isn't exact, because some types of work may actually add to the overhead of the function call (e.g., if there are enough instructions for the front end to get through before getting to the
ret
, the total time may increase on top of the 4-5 cycle empty function time, even if the total work is less than that) - but it's a good first order approximation.Your first function takes enough time that the actual work dominates the execution time. The second function is much faster, however, enabling it to "hide under" the existing time taken by the call/ret mechanics.
The solution is simple: duplicate the work within the function N times, so that the work always dominates. N=10 or N=50 or something like that is fine. You have to decide whether you want to test latency, in which case the output of one copy of the work should feed into the next, or throughput, in which case it shouldn't.
On other hand, if you actually want to test the cost of the function call + work, e.g., because that's how you'll be using it in real life, it is likely the results you have gotten is already close to correct: stuff really can be "incrementally free" when it hides behind a function call.
1 I'm putting "latency" in quotes here because it isn't clear whether we should be talking about the latency of
call/ret
or the throughput.call
andret
don't have any explicit outputs (andret
has no input), so it doesn't participate in a classic register-based dependency chain - but it might make sense to think of latency if you consider other hidden architectural components like the instruction pointer. In either case latency of throughput mostly points down to the same thing because allcall
andret
on a thread operate on the same state, so it doesn't make sense to have say "independent" vs "dependent" call chains.