Why is IPC lower than one on a modern processor?

1.5k views Asked by At
    7703.572978 task-clock (msec)         #    0.996 CPUs utilized          
          1,575 context-switches          #    0.204 K/sec                  
             18 cpu-migrations            #    0.002 K/sec                  
         65,975 page-faults               #    0.009 M/sec                  
 25,719,058,036 cycles                    #    3.340 GHz                    
<not supported> stalled-cycles-frontend 
<not supported> stalled-cycles-backend  
 12,323,855,909 instructions              #    0.48  insns per cycle        
  2,337,484,352 branches                  #  303.429 M/sec                  
    200,227,908 branch-misses             #    8.57% of all branches        
  3,167,237,318 L1-dcache-loads           #  411.139 M/sec                  
    454,416,650 L1-dcache-load-misses     #   14.35% of all L1-dcache hits  
    326,345,389 LLC-loads                 #   42.363 M/sec                  
<not supported> LLC-load-misses:HG      

I profiled my code written with libCCC in C by perf stat. It sorts an doubly linked list which causes a lot of list traversal operations, which means that it may ask many data located from different memory addresses. However, modern processor supports pipelining of multi stages, branch prediction and out-of-order execution, so these should increase the average amount of instructions executed in the same time interval. In fact, from the analysis data, only about an instruction is processed per two cycles. What's the reasons that may cause this phenomenon?

2

There are 2 answers

4
MSalters On

Your CPU is just waiting for memory, that's all. It's precisely this effect which justifies HyperThreading: modern CPU's can switch quickly enough that one core can work on two threads, executing instructions from one while the other thread is waiting on memory.

0
old_timer On

Just because you have a pipeline in no way means that it is always going to be used efficiently. Just like adding a cache, either one can cost you performance rather than improve your performance. So there is no reason to assume no matter how "modern" your processor that you are always going to get some magical performance. The problem starts with your code it is your app for some reason, how you write your app what language you use what compiler you use, what compiler options you use are the first steps in performance, then the platform, the ram, the cache, the disk the operating system, etc all play a role. It has been a lot of years since the processor was the bottleneck, assuming you could feed the processor as fast as it can consume and assuming the instruction sequences are pipeline friendly and you have basically no data accesses then sure you can scream, but the reality is the processor spends a lot of time waiting to be fed data or instructions. You can write very simple benchmarks even as small as a two instruction loop and see the performance vary widely due to various factors, the nature of the fetching of the processor, alignment causing it to fetch that much more than it needs to, moving the alignment to a sensitive area of the cache line can cause extra cache line reads. And that is just with a couple/few instructions, think about how many instructions your app is using. Then mix data accesses in with that. Benchmarking is somewhat usless of a task other than to show that it doesnt really mean anything other than you can manipulate the results to make something look good or bad.

In your case you are simply not writing your code to be friendly for the compiler to make it perform, more likely your data accesses, how you have structured your data (maybe using byte arrays instead of larger arrays, using structures with different sized items, etc), god forbid striping your data on power of two multiples of cache line boundaries.

Start tweaking your data or code, rearrange the support functions in the source code you might have function a, b, c defined in that order, change the order a, c, b, how does that change performance if at all? Add a dummy global function near the beginning of your project (or go into the boot strap and add or remove nops) maybe inline one then two then three then four, etc nops or other similar safe instructions in that dummy function. Change the size of your variables, change the size of your array elements if possible, re-arrange the order of variables defined in structures. One or all of these plus many many more things may affect your performance results.

Bottom line, having a modern processor has nothing to do with clocks per instruction averages for any random program. You want to get many instructions per clock you have to work for it (and no reason to assume that once you hit a sweet spot on your computer, you take that program to another compatible computer that the program performs there, it could be dog slow on other similar class machines and be fast on one).