Latency and number of FMA units

200 views Asked by Mirco Mannino At 11 April 2022 at 13:44

I'm trying to implement the convolution algoritm descibed in this paper. The authors state that the number of independent elements processed by FMA instructions is lower bounded by the latency of FMA istructions and it is upper bounded by the number/width of vector registers in the following way:

N_vec * N_fma * L_fma < X < N_reg * N_vec

Where:

N_vec: Number of elements contained in a vector register
N_reg: Number of vector registers
N_fma: Number of FMA units
L_fma: Latency of one FMA instruction

I'm using an Intel Core i7-10510U, and I set the parameters as follow:

NAME	VALUE
N_vec	8
N_reg	16
N_fma	2
L_fma	4

This is motivated by the following reasons: I'm using 256b registers (N_reg=16). I'm using 4B single precision floating point data (N_vec=8). My question is about the latency and the number of FMA units. From Intel Intrisic guide I see that on my architecture (i.e., Skylake) the _mm256_fmadd_ps instruction has a throughput of 0.5 (cycle/instructions) and a latency of 4 cycles. For this reason I assumed to have 2 FMA units.

By doing so I obtaing the bounds for X:

(8 * 2 * 4) < X < (16 * 8) => 64 < X < 128

However, by running some experiments, I see that the execution time is shorter when I use X=256 or X=512.

Am I getting any of the above parameters wrong? Especially N_fma and L_fma

Original Q&A

TechQA.

Latency and number of FMA units

There are 0 answers

Related Questions in C++

Related Questions in INTEL

Related Questions in SIMD

Related Questions in MICRO-OPTIMIZATION

Related Questions in FMA

Popular Questions

Popular Tags

Trending Questions