7703.572978 task-clock (msec) # 0.996 CPUs utilized
1,575 context-switches # 0.204 K/sec
18 cpu-migrations # 0.002 K/sec
65,975 page-faults # 0.009 M/sec
25,719,058,036 cycles # 3.340 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
12,323,855,909 instructions # 0.48 insns per cycle
2,337,484,352 branches # 303.429 M/sec
200,227,908 branch-misses # 8.57% of all branches
3,167,237,318 L1-dcache-loads # 411.139 M/sec
454,416,650 L1-dcache-load-misses # 14.35% of all L1-dcache hits
326,345,389 LLC-loads # 42.363 M/sec
<not supported> LLC-load-misses:HG
I profiled my code written with libCCC in C by perf stat
. It sorts an doubly linked list which causes a lot of list traversal operations, which means that it may ask many data located from different memory addresses. However, modern processor supports pipelining of multi stages, branch prediction and out-of-order execution, so these should increase the average amount of instructions executed in the same time interval. In fact, from the analysis data, only about an instruction is processed per two cycles. What's the reasons that may cause this phenomenon?
Your CPU is just waiting for memory, that's all. It's precisely this effect which justifies HyperThreading: modern CPU's can switch quickly enough that one core can work on two threads, executing instructions from one while the other thread is waiting on memory.