Fastest way to spread 4 bytes into 8 bytes (32bit -> 64bit)

572 views Asked by At

Assume you have a 32-bit unsigned integer, where the bytes are organized like this: a b c d. What is the fastest way to spread these bytes into a 64-bit unsigned integer in this fashion: 0 a 0 b 0 c 0 d? It is for the x86-64 architecture. I would like to know the fastest approach without using special intrinsics, although that would also be interesting. (I say 'fastest', but compact solutions with reasonable performance is also nice).

Edit for people who want context. This seems like a really easy work, just shifting some bytes around, yet it requires more instructions than you'd think (check this godbolt with optimizations). Therefore I just wonder if anyone knows of a way that would solve the problem with fewer instructions.

2

There are 2 answers

3
zch On BEST ANSWER
uint64_t x = ...;
// 0 0 0 0 a b c d
x |= x << 16;
// 0 0 a b ? ? c d
x = x << 8 & 0x00ff000000ff0000 | x & 0x000000ff000000ff;
// 0 a 0 b 0 c 0 d

And for completeness, modern x86 processors can do this with one quick instruction:

x = _pdep_u64(x, 0xff00ff00ff00ff)
0
Vlad Feinstein On

Something like this?

_mm256_cvtepu8_epi16(eight_bit_numbers): takes a 128-bit vector of sixteen 8-bit numbers, and converts it to a 256-bit vector of sixteen 16-bit signed integers. For example:

 __m128i value1 = _mm_setr_epi8(0x11, 0x22, 0x33, 0x44, 
    0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff, 0x00);
 __m256i value2 = _mm256_cvtepu8_epi16(value1);

Or for 32-bit -> 64-bit:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_cvtepu32_epi64