I have a simple program that I compile with -O0 -g -fno-omit-frame-pointer -fno-inline, and which I record with system-wide recording
timeout 3 perf record -a -g -F 99 -- ./program
perf script > out.perf && ../FlameGraph/stackcollapse-perf.pl out.perf > out.folded
In out.folded, I see various incorrect stack traces, even when I add --call-graph dwarf or --call-graph lbr to the unwind strategy. I'm on Ubuntu with an Intel 9900K processor. Here's the program:
#include <cmath>
#include <cstdlib>
#include <iostream>
#include <cstdint>
double a(double aa) {
return sqrt(aa);
}
int main() {
double aa = drand48();
for (size_t i = 0; i < UINT64_MAX; i++) {
if (a(aa) == 0) {
std::cout << std::endl; // Side-effect
}
}
return 0;
}
Here are some example incorrect stack traces where they go straight from main to sqrt and skip over a, with various unwinding strategies:
DWARF:
a_39391_[003]__3819.799248:;__libc_start_call_main;main;__sqrt 91016627
a_39391_[003]__3819.817492:;__libc_start_call_main;main;a 71222327
a_39391_[003]__3819.831741:;__libc_start_call_main;main;__ieee754_sqrt 62644598
a_39391_[003]__3819.844274:;__libc_start_call_main;main;a 58633093
The first and third lines show main calling __sqrt without going through a
and these ones with LBR unwinding:
a_40294_[000]__4306.958241:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a;_init 69630243
a_40294_[000]__4306.972230:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main 63241775
a_40294_[000]__4306.984884:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a;_init 59031331
a_40294_[000]__4306.996696:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a 56209420
The second-last line shows a call from main to a to _init rather than sqrt.
FP:
a_40380_[001]__4360.292014:;__libc_start_call_main;a 84184457
a_40380_[001]__4360.308855:;__libc_start_call_main;a 70243465
a_40380_[001]__4360.322906:;__libc_start_call_main;main;__sqrt 63725332
The last line shows __sqrt without a
Here's a snippet from the compiled binary showing that main does indeed call a which in turn calls sqrt:
00000000000011c0 <_Z1ad>:
...
11d2: e8 59 fe ff ff call 1030 <sqrt@plt>
...
00000000000011e0 <main>:
...
1211: e8 aa ff ff ff call 11c0 <_Z1ad>
So neither were inlined or optimized to a jmp tailcall.
Does anyone have any ideas how to get better stack traces? Is it because perf is running async while the stack is being changed while it unwinds it, and if so, is there some tool on Linux that pauses the thread each time it samples it?
This did not reproduce for me, using
perf6.5.13 on Ubuntu:That is the expected result for frame-pointer unwinder (since
sqrtdoesn't use frame pointer).When I use
--call-graph dwarf, I get the correct stack:Same when using
--call-graph lbr.