Can modern x86 CPUs do ideal out of order execution?

Question

Can modern x86 CPUs do ideal out of order execution?

107 views Asked by Joseph Garvin At 16 November 2023 at 20:57

My mental model for out of order execution is to think of it as like a sliding window on the instruction stream, where if there are any instructions in the window that are ready (their inputs have already been computed) they can be started immediately even if there are other instructions before them in the stream, provided CPU resources are available.

However I am trying to understand the impact of ordering between instructions that don't have a dependency on one another. Say I have two sets of three instructions that each form independent dependency chains A,B,C and X,Y,Z. None of these instructions change control flow. Because the instructions in each set form a dependency chain, the instructions within a set can't be executed out of order (C,B,A is invalid), but since the chains are independent they can be interleaved with each other (A,X,B,Y,C,Z is valid). Further assume that all six instructions fit inside the sliding window and are in cache. Will all valid sorts of the six instructions have the same performance or will some be faster than others? With N chains of M instructions?

My suspicion is no because it seems like ideal scheduling would be complex to do in hardware, but maybe there are good-enough-in-practice heuristics real CPUs use?

Assuming that some of the sorts are faster than others, how using performance counters or other tools can you detect that your code is running slower than it could due to a bad sort? Is there a way to eyeball that the sort in the assembly for a hot loop is bad? As opposed to the more usual suspects like branch mispredicts and cache misses.

Original Q&A

There are 1 answers

**Peter Cordes** · Accepted Answer · 2023-11-16T21:38:53+00:00

The concept you're asking about is "scheduling", either statically by the compiler (program order of the asm) or dynamically in the CPU (by the "scheduler", aka RS = Reservation Stations).

See How are x86 uops scheduled, exactly? - for each execution port, oldest-ready first, not detecting which uops are part of a long critical-path dependency chain. Usually that mostly happens naturally after a few iterations of a loop as the un-executed uops are the ones that are part of a loop-carried dependency chain.

For something like integer a*b*c + d*e, you would want to schedule the imul for a*b before d*e, since it's part of a chain of 2 imuls (3 cycle latency, 1/cycle throughput). If the d*e imul is statically first, in program order (and all 5 inputs are ready in registers), it will execute first. So the longer dependency chain can't start until the next cycle, leading to a critical path length of 1(resource conflict) + 3+3(imuls) + 1(add). If the critical path latency of that whole expression is part of a longer dep chain which is an overall bottleneck (yet somehow we still had all 5 inputs ready at once, or at least d, e, a, and b), then statically scheduling the critical path uops first would help, avoiding the resource conflict for the one execution port that has an integer multiplier.

(I picked integer multiply for a simpler example because modern x86 CPUs still only have one execution port that can handle it. SIMD shuffles on Intel CPUs are another common example of that, at least for some shuffles that can't run on Ice Lake's port 1, or with 512-bit vector width.)

uops are assigned to ports during issue/rename/alloc (when they're moved from the in-order front-end to the out-of-order back-end, the RS and ROB = ReOrder Buffer). This is done in parallel for all uops in an issue group (e.g. 4 to 6 on recent x86), based on the number of un-executed uops for each port at the start of the cycle. Scheduling can be sub-optimal in the first couple iterations of a loop; see x86_64 haswell instruction scheduled on already used port instead of unused one - Intel uses some tie-break heuristics to distribute uops to ports when the queue lengths are close, not checking port restrictions between uops in the same cycle.

For block diagrams of CPUs, see

https://www.realworldtech.com/sandy-bridge/5/
https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/ - Zen is broadly similar to Intel, except with separate schedulers and ports for integer vs. FP. (Apparently Intel's scheduler isn't fully unified in modern designs anymore either, with some entries only able to hold some types of uops.)
https://agner.org/optimize/ - no diagrams, but good details on various microarchitectures.

how using performance counters or other tools can you detect that your code is running slower than it could due to a bad sort?

Hard to detect in general. If you've run the numbers for the expected latency bottleneck of a loop-carried dependency chain but it's actually running slower than that, and there aren't stalls from cache misses or the front-end, then uop scheduling is something to look at.

On Intel CPUs, idq_uops_not_delivered.cycles_fe_was_ok counts cycles where there were uops ready to issue from the front-end (IDQ = Instruction Decode Queue), but they didn't because of back-end stalls.

Unrolling to hide latency (by overlapping multiple dep chains instead of having one longer one) is hugely important for higher-latency operations like floating-point math, like dot products or summing arrays.

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) - unrolling helps beyond the bare minimum that would be needed with perfect scheduling. With only 8 dep chains, and 2x 4-cycle FMA pipelines on Skylake, any imperfect scheduling can lead to a cycle where no FMA executes on one of the pipes, losing throughput, and also losing latency progress that there's no slack to make up. (Since the out-of-order window is limited by the RS size, having it fill up can also eventually cost a cycle on the progress of other dep chains.)
Are loads and stores the only instructions that gets reordered? - a microbenchmark experiment with two long imul dep chains (3c latency, 1c throughput), how fully can they be overlapped.
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths - more detail about that experiment.

TechQA.

Can modern x86 CPUs do ideal out of order execution?

There are 1 answers

Related Questions in ASSEMBLY

Related Questions in X86

Related Questions in CPU-ARCHITECTURE

Related Questions in LOW-LEVEL

Related Questions in SUPERSCALAR

Popular Questions

Popular Tags

Trending Questions