I have an array of 8 bit integers that I want to process through SIMD instructions. Since those integers will be used along single precision floating point numbers, I actually want to load them in 32 bit lanes instead of the more "natural" 8 bit lanes.
Assuming AVX512, if I have the following array:
std::array< std::uint8_t, 16 > i{ i0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13, i14, i15 };
I wish to end up with a __m512i register filled with the following bytes:
[ 0, 0, 0, i0,
0, 0, 0, i2,
0, 0, 0, i3,
0, 0, 0, i4,
0, 0, 0, i5,
0, 0, 0, i6,
0, 0, 0, i7,
0, 0, 0, i8,
0, 0, 0, i9,
0, 0, 0, i10,
0, 0, 0, i11,
0, 0, 0, i12,
0, 0, 0, i13,
0, 0, 0, i14,
0, 0, 0, i15 ]
What is the best way to achieve that? I currently handroll it using:
_mm512_set_epi32(
a[0], a[1], a[2], a[3],
a[4], a[5], a[6], a[7],
a[8], a[9], a[10], a[11],
a[12], a[13], a[14], a[15]);
Note: I used AVX512 as an example, ideally I would like a "generic" strategy that can be abstracted on several instruction sets using e.g. Google Highway.
It is possible to do this using Google Highway.
Then you get the plumbing and compile this for the desired targets. Assuming we compiled for AVX512, this could be used as follows:
For AVX512, this is equivalent to: