Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float
is of type __m256
, while result
is of type short int*
or short int[8]
.
for(i = 0; i < 8; i++)
result[i] = (short int)result_in_float[i];
I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1)
intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those values (in the form of 16 bit integers) to the memory, and I want to do that all using vector instructions.
Searching around the internet, I found an intrinsic by the name of_mm256_mask_storeu_epi16
, but I'm not really sure if that would do the trick, as I couldn't find an example of its usage.
_mm256_cvtps_epi32
is a good first step, the conversion to a packed vector of shorts is a bit annoying, requiring a cross-slice shuffle (so it's good that it's not in a dependency chain here).Since the values can be assumed to be in the right range (as per the comment), we can use
_mm256_packs_epi32
instead of_mm256_shuffle_epi8
to do the conversion, either way it's a 1-cycle instruction on port 5 but using_mm256_packs_epi32
avoids having to get a shuffle mask from somewhere.So to put it together (not tested)
The last step (cast) is free, it just changes the type.
If you had two vectors of floats to convert, you could re-use most of the instructions, eg: (not tested either)