There are AVX-512 VNNI instructions starting since Cascade Lake Intel CPU which can accelerate inference of quantized neural networks on CPU.
In particular there is a instuction _mm512_dpbusd_epi32
(vpdpbusd
) which allows to perform multiplication of 8-bit signed and unsigned integers and accumulate them into 32-bit integer accumulators.
There is a pseudo code of this instruction below:
void _mm512_dpbusd_epi32(int32_t sum[16], uint8_t a[16][4], int8_t b[16][4])
{
for(int i = 0; i < 16; ++i)
sum[i] +=
(int)a[i][0]*b[i][0] + (int)a[i][1]*b[i][1] +
(int)a[i][2]*b[i][2] + (int)a[i][3]*b[i][3];
}
Unfortunately the intel CPUs until Cascade Lake don't have this instruction so there is a question to emulate this one with using of previous extension (for example AVX-512BW). So my question is: How is make this emulation maximal effective as possible?
I think this question does not have one correct answer.
On the one hand the fast emulation of
_mm512_dpbusd_epi32
with using of AVX-512BW extension may be looked as:This implementation uses only 3 instructions (and all of them are fast). But it can give incorrect result due to possible overflow of INT16 in
_mm512_maddubs_epi16
instruction.On the other hand correct emulation looks awful and takes 14 instructions (and some of them are notably slow):