I am working on a data structure where I have an array of 16 uint64. They are laid out like this in memory (each below representing a single int64):
A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3
The desired result is to transpose the array into this:
A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3
The rotation of the array 90 degrees is also an acceptable solution for the future loop:
D0 C0 B0 A0 D1 C1 B1 A1 D2 C2 B2 A2 D3 C3 B3 A3
I need this in order to operate on the arrow fast at a later point (Traverse it sequentially with another SIMD trip, 4 at a time).
So far, I have tried to "blend" the data by loading up a 4 x 64 bit vector of A's, bitmaskising and shuffling the elements and OR'ing it with B's etc and then repeating that for C's... Unfortunately, this is 5 x 4 SIMD instructions per segment of 4 elements in the array (one load, one mask, one shuffle, one or with next element and finally a store). It seems I should be able to do better.
I have AVX2 available and I a compiling with clang.
I don't have hardware to test this on right now but something like the following should do what you want
The
intrinsic selects 128-bit lanes from two sources. You can read about it in the Intel Intrinsic Guide. There is a version
_mm256_permute2f128_si256
which only needs AVX and acts in the floating point domain. I used this to check that I used the correct control words.