Latency and number of FMA units

189 views Asked by At

I'm trying to implement the convolution algoritm descibed in this paper. The authors state that the number of independent elements processed by FMA instructions is lower bounded by the latency of FMA istructions and it is upper bounded by the number/width of vector registers in the following way:

N_vec * N_fma * L_fma < X < N_reg * N_vec

Where:

  • N_vec: Number of elements contained in a vector register
  • N_reg: Number of vector registers
  • N_fma: Number of FMA units
  • L_fma: Latency of one FMA instruction

I'm using an Intel Core i7-10510U, and I set the parameters as follow:

NAME VALUE
N_vec 8
N_reg 16
N_fma 2
L_fma 4

This is motivated by the following reasons: I'm using 256b registers (N_reg=16). I'm using 4B single precision floating point data (N_vec=8). My question is about the latency and the number of FMA units. From Intel Intrisic guide I see that on my architecture (i.e., Skylake) the _mm256_fmadd_ps instruction has a throughput of 0.5 (cycle/instructions) and a latency of 4 cycles. For this reason I assumed to have 2 FMA units.

By doing so I obtaing the bounds for X:

(8 * 2 * 4) < X < (16 * 8) => 64 < X < 128

However, by running some experiments, I see that the execution time is shorter when I use X=256 or X=512.

Am I getting any of the above parameters wrong? Especially N_fma and L_fma

0

There are 0 answers