I was trying to measure the time for a snippet of code and noticed that the timings were about 50ns faster when I ran the program from inside my editor, QtCreator, compared to when I ran it from a bash shell started in a gnome-terminal. I'm using Ubuntu 20.04 as OS.
A small program to reproduce my problem:
#include <stdio.h>
#include <time.h>
struct timespec now() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return now;
}
long interval_ns(struct timespec tick, struct timespec tock) {
return (tock.tv_sec - tick.tv_sec) * 1000000000L
+ (tock.tv_nsec - tick.tv_nsec);
}
int main() {
// sleep(1);
for (size_t i = 0; i < 10; i++) {
struct timespec tick = now();
struct timespec tock = now();
long elapsed = interval_ns(tick, tock);
printf("It took %lu ns\n", elapsed);
}
return 0;
}
Output when running from within QtCreator
It took 84 ns
It took 20 ns
It took 20 ns
It took 21 ns
It took 21 ns
It took 21 ns
It took 22 ns
It took 21 ns
It took 20 ns
It took 21 ns
And when running from my shell inside the terminal:
$ ./foo
It took 407 ns
It took 136 ns
It took 74 ns
It took 73 ns
It took 77 ns
It took 79 ns
It took 74 ns
It took 81 ns
It took 74 ns
It took 78 ns
Things I've tried which didn't make a difference
- Letting QtCreator start the program in a terminal
- Using rdtsc and rdtscp calls instead of clock_gettime (same relative differences in runtime)
- clearing the environment from the terminal by running it under
env -i
- Starting the program using sh instead of bash
I have verified that the same binary is called in all cases. I have verified that the nice value is 0 for the program in all cases.
The Question
Why does starting the program from my shell make a difference? Any suggestions on what to try?
Update
If I add a sleep(1) call to the beginning of main, both the QtCreator and gnome-terminal/bash invocations reports the longer execution times.
If I add a system("ps -H") call at the beginning of main, but drop the previously mentioned sleep(1): both invocations report the short execution times (~20 ns).
Just add more iterations to give the CPU time to ramp up to max clock speed. Your "slow" times are with the CPU at low-power idle clockspeed.
QtCreator apparently uses enough CPU time to make this happen before your program runs, or else you're compiling + running and the compilation process serves as a warm-up. (vs.
bash
's fork/execve being lighter weight.)See Idiomatic way of performance evaluation? for more about doing warm-up runs when benchmarking, and also Why does this delay-loop start to run faster after several iterations with no sleep?
On my i7-6700k (Skylake) running Linux, increasing the loop iteration count to 1000 is sufficient to get the final iterations running at full clock speed, even after the first couple iterations handling page faults, warming up the iTLB, uop cache, data caches, and so on.
On my system, energy_performance_preference is set to
balance_performance
, so the hardware P-state governor isn't as aggressive as withperformance
. Usegrep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
to check, usesudo
to change it:Even running it under
perf stat ./a.out
is enough to ramp up to max clock speed very quickly, though; it really don't take much. Butbash
's command parsing after you press return is very cheap, not much CPU work done before it callsexecve
and reachesmain
in your new process.The
printf
with line-buffered output is what takes most of the CPU time in your program, BTW. That's why it takes so few iterations to ramp up to speed. e.g. if you runperf stat --all-user -r10 ./a.out
, you'll see the user-space core clock cycles per second are only like 0.4GHz, the rest of the time spent in the kernel inwrite
system calls.