I understand that the term Cycle Per Instruction closely relates to the superscalarity of the processor, a term which I have not fully understood. According to Wikipedia, "...a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor". In the same article, there is a hint that superscalarity is not necessarily related to instruction pipelining, a concept with which I'm fairly familiar.
Now, let's get concrete by taking the example of _mm256_shuffle_ps
, which, according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX,AVX2,FMA, has a CPI of 0.5 for the Alder Lake micro-architecture.
Questions:
- Can I assume that there are exactly 2 identical execution units which execute
_mm256_shuffle_ps
in all Alder Lake chips? - How can a programmer know which separate instructions involve the same executions units?
- If there are different numbers of execution units for different instructions (such as
_mm256_shuffle_ps
), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?
Thanks in advance for the transfer of knowledge.
Superscalar is usually a term you'd apply to CPU's of old, e.g. the original pentium. Back in those days, you'd have two seperate pipes, the U (primary) and V (secondary) pipe, which would allow you to potentially dispatch two instructions at the same time (i.e. it had 2 execution units). It was effectively a way of getting slightly better performance from an in-order processor core (although that came with caveats - e.g. pipeline bubbles could be an issue)
These days processors tend to use Out of Order Execution (OOOE) backed by a larger number of execution units. Alder Lake CPU's have 12 execution units, however those execution units tend to be specialised to some extent - e.g. load/store, pointer arithmetic, SIMD FPU units, etc. That's why you won't see 12 execution units capable of performing a shuffle. It can dispatch 12 micro-ops per cycle, but those ops can't all be the same instruction.
No, you can't assume that. You can assume that there are two execution units which are capable of executing
_mm256_shuffle_ps
, but that doesn't mean those two units are identical. For example, we can see there are 3 execution units that can work on 256bit YMM registers, and we can see from the instruction timings that all 3 can perform_mm_add_epi32
. However, only 2 can perform_mm_shuffle_ps
, and only 1 can perform_mm_div_ps
, so they are clearly not the same....Unless the manufacturer explicitly states the capabilities of each execution port (sometimes you'll find that info in the technical manual for the CPU), you're pretty much limited to making educated guesses (e.g. the Apple M1)
Modern Intel processors are not superscalar, therefore describing them as such makes no sense at all.
Alder Lake is able to dispatch 12 instructions per clock, using Out-Of-Order-Execution. The types of instruction the execution units can handle, is typically geared up to cover a range of common cases. For example, consider this code:
Providing 12 execution units that are all capable of executing an
_mm_add_ps
instruction doesn't really make any sense. It makes more sense to balance the number of SIMD execution units with all those other common tasks (e.g. address manipulation, looping, etc).