I have a simple program that I compile with -O0 -g -fno-omit-frame-pointer -fno-inline
, and which I record with system-wide recording
timeout 3 perf record -a -g -F 99 -- ./program
perf script > out.perf && ../FlameGraph/stackcollapse-perf.pl out.perf > out.folded
In out.folded
, I see various incorrect stack traces, even when I add --call-graph dwarf
or --call-graph lbr
to the unwind strategy. I'm on Ubuntu with an Intel 9900K processor. Here's the program:
#include <cmath>
#include <cstdlib>
#include <iostream>
#include <cstdint>
double a(double aa) {
return sqrt(aa);
}
int main() {
double aa = drand48();
for (size_t i = 0; i < UINT64_MAX; i++) {
if (a(aa) == 0) {
std::cout << std::endl; // Side-effect
}
}
return 0;
}
Here are some example incorrect stack traces where they go straight from main
to sqrt
and skip over a
, with various unwinding strategies:
DWARF:
a_39391_[003]__3819.799248:;__libc_start_call_main;main;__sqrt 91016627
a_39391_[003]__3819.817492:;__libc_start_call_main;main;a 71222327
a_39391_[003]__3819.831741:;__libc_start_call_main;main;__ieee754_sqrt 62644598
a_39391_[003]__3819.844274:;__libc_start_call_main;main;a 58633093
The first and third lines show main
calling __sqrt
without going through a
and these ones with LBR unwinding:
a_40294_[000]__4306.958241:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a;_init 69630243
a_40294_[000]__4306.972230:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main 63241775
a_40294_[000]__4306.984884:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a;_init 59031331
a_40294_[000]__4306.996696:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a 56209420
The second-last line shows a call from main
to a
to _init
rather than sqrt
.
FP:
a_40380_[001]__4360.292014:;__libc_start_call_main;a 84184457
a_40380_[001]__4360.308855:;__libc_start_call_main;a 70243465
a_40380_[001]__4360.322906:;__libc_start_call_main;main;__sqrt 63725332
The last line shows __sqrt
without a
Here's a snippet from the compiled binary showing that main
does indeed call a
which in turn calls sqrt
:
00000000000011c0 <_Z1ad>:
...
11d2: e8 59 fe ff ff call 1030 <sqrt@plt>
...
00000000000011e0 <main>:
...
1211: e8 aa ff ff ff call 11c0 <_Z1ad>
So neither were inlined or optimized to a jmp
tailcall.
Does anyone have any ideas how to get better stack traces? Is it because perf is running async while the stack is being changed while it unwinds it, and if so, is there some tool on Linux that pauses the thread each time it samples it?
This did not reproduce for me, using
perf
6.5.13 on Ubuntu:That is the expected result for frame-pointer unwinder (since
sqrt
doesn't use frame pointer).When I use
--call-graph dwarf
, I get the correct stack:Same when using
--call-graph lbr
.