I'm implementing conversions between SSE types and I found that implementing int8->int64 widening conversion for pre-SSE4.1 targets is cumbersome.
The straightforward implementation would be:
inline __m128i convert_i8_i64(__m128i a)
{
#ifdef __SSE4_1__
return _mm_cvtepi8_epi64(a);
#else
a = _mm_unpacklo_epi8(a, a);
a = _mm_unpacklo_epi16(a, a);
a = _mm_unpacklo_epi32(a, a);
return _mm_srai_epi64(a, 56); // missing instrinsic!
#endif
}
But since _mm_srai_epi64
doesn't exist until AVX-512, there are two options at this point:
- implementing
_mm_srai_epi64
, or - implementing
convert_i8_i64
in a different way.
I'm not sure which one would be the most efficient solution. Any idea?
The unpacking intrinsics are used here in a funny way. They "duplicate" the data, instead of adding sign-extension, as one would expect. For example, before the first iteration you have in your register the following
If you convert
a
andb
to 16 bits, you should get this:Here
A
andB
are sign-extensions ofa
andb
, that is, both of them are either 0 or -1.Instead of this, your code gives
And then you convert it to the proper result by shifting right.
However, you are not obliged to use the same operand twice in the "unpack" intrinsics. You could get the desired result if you "unpacked" the following two registers:
That is:
(if that
_mm_srai_epi8
intrinsic actually existed)You can apply the same idea to the last stage of your conversion. You want to "unpack" the following two registers:
To get them, right-shift the 32-bit data:
So the last "unpack" is