I am trying to do profile with libunwind (using linux perf), with perf top
monitoring the target process, I get this assembly time cost screen:
0.19 │ mov %rcx,0x18(%rsp) ▒
│ trace_lookup(): ▒
1.54 │ mov 0x8(%r9),%rcx ▒
│ _ULx86_64_tdep_trace(): ▒
0.52 │ and $0x1,%edx ◆
0.57 │ mov %r14d,0xc(%rsp) ▒
0.40 │ mov 0x78(%rsp),%r10 ▒
1.24 │ sub %rdx,%r15 ▒
│ trace_lookup(): ▒
0.35 │ shl %cl,%r12d ▒
│ _ULx86_64_tdep_trace(): ▒
2.18 │ mov 0x90(%rsp),%r8 ▒
│ trace_lookup(): ▒
0.46 │ imul %r15,%r13 ▒
│ _ULx86_64_tdep_trace(): ▒
0.59 │ mov %r15,0x88(%rsp) ▒
│ trace_lookup(): ▒
0.50 │ lea -0x1(%r12),%rdx ▒
1.22 │ shr $0x2b,%r13 ▒
0.37 │ and %r13,%rdx ▒
0.57 │177: mov %rdx,%rbp ▒
0.43 │ shl $0x4,%rbp ▒
1.33 │ add %rdi,%rbp ▒
0.49 │ mov 0x0(%rbp),%rsi ▒
24.40 │ cmp %rsi,%r15 ▒
│ ↓ jne 420 ▒
│ _ULx86_64_tdep_trace(): ▒
2.10 │18e: movzbl 0x8(%rbp),%edx ▒
3.68 │ test $0x8,%dl ▒
│ ↓ jne 370 ▒
1.27 │ mov %edx,%eax ▒
0.06 │ shl $0x5,%eax ▒
0.73 │ sar $0x5,%al ▒
1.70 │ cmp $0xfe,%al ▒
│ ↓ je 380 ▒
0.01 │ ↓ jle 2f0 ▒
0.01 │ cmp $0xff,%al ▒
│ ↓ je 3a0 ▒
0.02 │ cmp $0x1,%al ▒
│ ↓ jne 298 ▒
0.01 │ and $0x10,%edx ▒
│ movl $0x1,0x10(%rsp) ▒
│ movl $0x1,0x1c8(%rbx) ▒
0.00 │ ↓ je 393
The corresponding source code is here trace_lookup source code, If I read correctly, the number of lines of code corresponding to this hot path cmp
instruction is line 296, but I don't know why this line is so slow and cost most of the time?
Command
cmp %rsi,%r15
is marked as having huge overhead because it waits for data to be loaded from cache or memory bymov 0x0(%rbp),%rsi
command. There is probably L1 or even L2 cache miss on that command.For the code fragment
you have 24% of profiling events of the current function accounted to the cmp instruction. Default event for sampling profiling is "cycles" (hardware event for CPU clock cycles) or "cpu-clock" (software event for linear time). So, around 24% of sampling interrupts which did interrupt this function were reported for instruction address of this cmp command. There is systematic skew possible with profiling and modern Out-of-order CPUs, when cost is reported not for command which did run slowly, but for the command which did not finish its execution (retire) quickly. This cmp+jne command pair (fused uop) will change instruction flow of program if %rsi register value is not equal to %r15 register value. For ancient times such command should just read two registers and compare their values, which is fast and should not take 1/4 of function execution time. But with modern CPU registers are not just 32 or 64 bit place to store the value, they have some hidden flags (or renaming techniques) used in Out-of-order engines. In your example, there was
mov 0x0(%rbp),%rsi
which did change %rsi register. This command is load from memory by address *%rbp. CPU did start this load into cache/memory subsystem and mark %rsi register as "load pending from memory", continuing to execute instructions. There are some chances that next instructions will not require result of that load (which takes some time, for example Haswell: 4 cpu cycles for L1 hit, 12 for L2 hit, 36-66 for L3 hit, and additional 50-100 ns for cache miss and RAM read). But in your case next instruction was cmp+jne with read from %rsi, and that instruction can't be finished until data from memory is written to %rsi (CPU may block in the middle of cmp+jne execution or do many restarts of that command). So, cmp has 24% overhead because that mov did miss nearest caches. With more advanced counters you can estimate which cache it did miss, and which cache/memory layer did serve the request most often.With so short asm fragment it can be hard to find corresponding code line in source code of trace_lookup and to find what value and why was not in L1/L2 cache. You should try to write shorted reproducible example.