Optimal SIMD algorithm to rotate or transpose an array

Question

Optimal SIMD algorithm to rotate or transpose an array

2.8k views Asked by Thomas Kejser At 29 December 2024 at 10:37

I am working on a data structure where I have an array of 16 uint64. They are laid out like this in memory (each below representing a single int64):

A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3

The desired result is to transpose the array into this:

A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3

The rotation of the array 90 degrees is also an acceptable solution for the future loop:

D0 C0 B0 A0 D1 C1 B1 A1 D2 C2 B2 A2 D3 C3 B3 A3

I need this in order to operate on the arrow fast at a later point (Traverse it sequentially with another SIMD trip, 4 at a time).

So far, I have tried to "blend" the data by loading up a 4 x 64 bit vector of A's, bitmaskising and shuffling the elements and OR'ing it with B's etc and then repeating that for C's... Unfortunately, this is 5 x 4 SIMD instructions per segment of 4 elements in the array (one load, one mask, one shuffle, one or with next element and finally a store). It seems I should be able to do better.

I have AVX2 available and I a compiling with clang.

Original Q&A

There are 2 answers

user3636086 On 20 November 2014 at 10:21

An alternative is to use the gather instructions, you can load directly the transposed matrix. The five lines of code below are ok with gcc on a i7-Haswell.

  int32_t stride = 4 * sizeof(A[0]);
  __m128i r128_gather_idx = _mm_set_epi32(3 * stride, 2 * stride, 1 * stride, 0 * stride);
  __m256i row0 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 0]), r128_gather_idx, 1);
  __m256i row1 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 1]), r128_gather_idx, 1);
  __m256i row2 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 2]), r128_gather_idx, 1);

**Z boson** · Accepted Answer · 2014-11-19T10:28:09+00:00

uint64_t A[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
__m256i row0 = _mm256_loadu_si256((__m256i*)&A[ 0]); //0 1 2 3
__m256i row1 = _mm256_loadu_si256((__m256i*)&A[ 4]); //4 5 6 7
__m256i row2 = _mm256_loadu_si256((__m256i*)&A[ 8]); //8 9 a b
__m256i row3 = _mm256_loadu_si256((__m256i*)&A[12]); //c d e f

I don't have hardware to test this on right now but something like the following should do what you want

__m256i tmp3, tmp2, tmp1, tmp0;
tmp0 = _mm256_unpacklo_epi64(row0, row1);            //0 4 2 6
tmp1 = _mm256_unpackhi_epi64(row0, row1);            //1 5 3 7
tmp2 = _mm256_unpacklo_epi64(row2, row3);            //8 c a e
tmp3 = _mm256_unpackhi_epi64(row2, row3);            //9 d b f
//now select the appropriate 128-bit lanes
row0 = _mm256_permute2x128_si256(tmp0, tmp2, 0x20);  //0 4 8 c
row1 = _mm256_permute2x128_si256(tmp1, tmp3, 0x20);  //1 5 9 d
row2 = _mm256_permute2x128_si256(tmp0, tmp2, 0x31);  //2 6 a e
row3 = _mm256_permute2x128_si256(tmp1, tmp3, 0x31);  //3 7 b f

The

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm)

intrinsic selects 128-bit lanes from two sources. You can read about it in the Intel Intrinsic Guide. There is a version _mm256_permute2f128_si256 which only needs AVX and acts in the floating point domain. I used this to check that I used the correct control words.

TechQA.

Optimal SIMD algorithm to rotate or transpose an array

There are 2 answers

Related Questions in ASSEMBLY

Related Questions in INTEL

Related Questions in SIMD

Related Questions in TRANSPOSE

Related Questions in AVX2

Popular Questions

Popular Tags

Trending Questions