Relation between CPI and number of execution units when looking at SIMD intrinsics

328 views Asked by At

I understand that the term Cycle Per Instruction closely relates to the superscalarity of the processor, a term which I have not fully understood. According to Wikipedia, "...a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor". In the same article, there is a hint that superscalarity is not necessarily related to instruction pipelining, a concept with which I'm fairly familiar.

Now, let's get concrete by taking the example of _mm256_shuffle_ps, which, according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX,AVX2,FMA, has a CPI of 0.5 for the Alder Lake micro-architecture.

Questions:

  1. Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?
  2. How can a programmer know which separate instructions involve the same executions units?
  3. If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?

Thanks in advance for the transfer of knowledge.

1

There are 1 answers

5
robthebloke On

Superscalar is usually a term you'd apply to CPU's of old, e.g. the original pentium. Back in those days, you'd have two seperate pipes, the U (primary) and V (secondary) pipe, which would allow you to potentially dispatch two instructions at the same time (i.e. it had 2 execution units). It was effectively a way of getting slightly better performance from an in-order processor core (although that came with caveats - e.g. pipeline bubbles could be an issue)

These days processors tend to use Out of Order Execution (OOOE) backed by a larger number of execution units. Alder Lake CPU's have 12 execution units, however those execution units tend to be specialised to some extent - e.g. load/store, pointer arithmetic, SIMD FPU units, etc. That's why you won't see 12 execution units capable of performing a shuffle. It can dispatch 12 micro-ops per cycle, but those ops can't all be the same instruction.

Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?

No, you can't assume that. You can assume that there are two execution units which are capable of executing _mm256_shuffle_ps, but that doesn't mean those two units are identical. For example, we can see there are 3 execution units that can work on 256bit YMM registers, and we can see from the instruction timings that all 3 can perform _mm_add_epi32. However, only 2 can perform _mm_shuffle_ps, and only 1 can perform _mm_div_ps, so they are clearly not the same....

How can a programmer know which separate instructions involve the same executions units?

Unless the manufacturer explicitly states the capabilities of each execution port (sometimes you'll find that info in the technical manual for the CPU), you're pretty much limited to making educated guesses (e.g. the Apple M1)

If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?

Modern Intel processors are not superscalar, therefore describing them as such makes no sense at all.

Alder Lake is able to dispatch 12 instructions per clock, using Out-Of-Order-Execution. The types of instruction the execution units can handle, is typically geared up to cover a range of common cases. For example, consider this code:

void func(float* r, float* a, float* b) {

   // basic integer ops: increment and less-than
   for(int i = 0; i < 128; ++i) {

      // 2 address manipulation instructions
      float* addr_a = a + i * 4;
      float* addr_b = b + i * 4;

      // 2 load instructions
      __m128 A = _mm_load_ps(addr_a);
      __m128 B = _mm_load_ps(addr_b);

      // an addition
      __m128 R = _mm_add_ps(A, B);

      // another address manipulation op
      float* addr_r = r + i * 4;

      // a store instruction
      _mm_store_ps(addr_r, R);
   }
}

Providing 12 execution units that are all capable of executing an _mm_add_ps instruction doesn't really make any sense. It makes more sense to balance the number of SIMD execution units with all those other common tasks (e.g. address manipulation, looping, etc).