I have a simple program that I compile with -O0 -g -fno-omit-frame-pointer -fno-inline, and which I record with system-wide recording

timeout 3 perf record -a -g -F 99 -- ./program
perf script > out.perf && ../FlameGraph/stackcollapse-perf.pl out.perf > out.folded

In out.folded, I see various incorrect stack traces, even when I add --call-graph dwarf or --call-graph lbr to the unwind strategy. I'm on Ubuntu with an Intel 9900K processor. Here's the program:

#include <cmath>
#include <cstdlib>
#include <iostream>
#include <cstdint>

double a(double aa) {
  return sqrt(aa);
}

int main() {
  double aa = drand48();
  for (size_t i = 0; i < UINT64_MAX; i++) {
    if (a(aa) == 0) {
      std::cout << std::endl; // Side-effect
    }
  }
  return 0;
}

Here are some example incorrect stack traces where they go straight from main to sqrt and skip over a, with various unwinding strategies:

DWARF:

a_39391_[003]__3819.799248:;__libc_start_call_main;main;__sqrt 91016627
a_39391_[003]__3819.817492:;__libc_start_call_main;main;a 71222327
a_39391_[003]__3819.831741:;__libc_start_call_main;main;__ieee754_sqrt 62644598
a_39391_[003]__3819.844274:;__libc_start_call_main;main;a 58633093

The first and third lines show main calling __sqrt without going through a

and these ones with LBR unwinding:

a_40294_[000]__4306.958241:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a;_init 69630243
a_40294_[000]__4306.972230:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main 63241775
a_40294_[000]__4306.984884:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a;_init 59031331
a_40294_[000]__4306.996696:;_start;__libc_start_main@@GLIBC_2.34;__libc_start_call_main;main;a 56209420

The second-last line shows a call from main to a to _init rather than sqrt.

FP:

a_40380_[001]__4360.292014:;__libc_start_call_main;a 84184457
a_40380_[001]__4360.308855:;__libc_start_call_main;a 70243465
a_40380_[001]__4360.322906:;__libc_start_call_main;main;__sqrt 63725332

The last line shows __sqrt without a


Here's a snippet from the compiled binary showing that main does indeed call a which in turn calls sqrt:

00000000000011c0 <_Z1ad>:
...
    11d2:   e8 59 fe ff ff          call   1030 <sqrt@plt>
...

00000000000011e0 <main>:
...
    1211:   e8 aa ff ff ff          call   11c0 <_Z1ad>

So neither were inlined or optimized to a jmp tailcall.

Does anyone have any ideas how to get better stack traces? Is it because perf is running async while the stack is being changed while it unwinds it, and if so, is there some tool on Linux that pauses the thread each time it samples it?

1

There are 1 answers

1
Employed Russian On

This did not reproduce for me, using perf 6.5.13 on Ubuntu:

timeout 3  perf record -g ./a.out
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.745 MB perf.data (10362 samples) ]

perf script > f.out

cat f.out

a.out   57072 42799.672371:    1085145 cycles:P:
            7fc7fc30fd40 __sqrt_finite@GLIBC_2.15+0x0 (/usr/lib/x86_64-linux-gnu/libm.so.6)
            563f3d7f21ae main+0x2e (/tmp/a.out)
            7fc7fbe456ca __libc_start_call_main+0x7a (/usr/lib/x86_64-linux-gnu/libc.so.6)

That is the expected result for frame-pointer unwinder (since sqrt doesn't use frame pointer).

When I use --call-graph dwarf, I get the correct stack:

timeout 3  perf record -g --call-graph dwarf ./a.out &&
perf script > f.out

a.out   57391 43133.029241:     288559 cycles:P:
            7f23dfa8547a __sqrt+0xa (/usr/lib/x86_64-linux-gnu/libm.so.6)
            56112adfd173 a(double)+0x1a (/tmp/a.out)
            56112adfd1ad main+0x2d (/tmp/a.out)
            7f23df6456c9 __libc_start_call_main+0x79 (/usr/lib/x86_64-linux-gnu/libc.so.6)
            7f23df645784 __libc_start_main_impl+0x84 (inlined)
            56112adfd090 _start+0x20 (/tmp/a.out)

Same when using --call-graph lbr.