I'm looking for an effective way to extract lower 64 bit integer from __m128i on AMD Piledriver. Something like this:
static inline int64_t extractlo_64(__m128i x)
{
int64_t result;
// extract into result
return result;
}
Instruction tables say that common approach - using _mm_extract_epi64() - is ineffective on this processor. It generates PEXTRQ instruction which has a latency of 10 cycles (compared to 2-3 cycles in Intel processors). Is there any better way to do this?
On x86-64 you can use
_mm_cvtsi128_si64
, which translates to a singleMOVQ r64, xmm
instruction