There is new AVX-512 VNNI instructions in Cascade Lake Intel CPU which can accelerate inference of neural networks on CPU. I integrated them into Simd Library to accelerate Synet (my small framework for inference of neural networks) and obtained significant performance boost.
In fact I used only one instruction _mm512_dpbusd_epi32
(vpdpbusd
) which allows to perform multiplication of 8-bit signed and unsigned integers and then accumulates them into 32-bit integer accumulators.
It will be great to to perform analogue optimizations for NEON (ARM platform).
So there is a question:
Is exist any analogue of NEON instruction to emulate vpdpbusd
? If there is no analogue what is the best way to emulate the instruction ?
There is a scalar implementation below (to best understand what the function must do):
inline void pdpbusd(int32x4_t& sum, uint8x16_t input, int8x16_t weight)
{
for (size_t i = 0; i < 4; ++i)
for (size_t j = 0; j < 4; ++j)
sum[i] += int32_t(input[i * 4 + j]) * int32_t(weight[i * 4 + j]);
}
The most straightforward implementation of that requires 3 instructions;
vmovl.s8
,vmovl.u8
to extend the signed and unsigned 8 bit values to 16 bit, followed byvmlal.s16
, to do a signed lengthening 16 bit multiplication, accumulated into a 32 bit register. And as thevmlal.s16
only handles 4 elements, you'd need a secondvmlal.s16
to multiply and accumulate the following 4 elements - so 4 instructions for 4 elements.For aarch64 syntax, the corresponding instructions are
sxtl
,uxtl
andsmlal
.Edit: If the output elements should be aggregated horizontally, one can't use the fused multiply-accumulate instructions
vmlal
. Then the solution would bevmovl.s8
andvmovl.u8
, followed byvmul.i16
(for 8 input elements),vpaddl.s16
(to aggregate two elements horizontally), followed by anothervpadd.i32
to get the sum of 4 elements horizontally. So 5 instructions for 8 input elements, or 10 instructions for a full 128 bit vector, followed by one finalvadd.s32
to accumulate the final result to the accumulator. On AArch64, the equivalent ofvpadd.i32
,addp
, can handle 128 bit vectors, so it's one instruction less there.If you're using instrinsics, the implementation could look something like this: