Is there a more efficient way to da an AVX(2) scatter than the following code generated by gcc?

140 views Asked by zx-81 At 14 August 2022 at 11:27

What is the most efficient way to scatter 8x32 bit floats in a AVX2 register A to memory locations indexed by another (8x32 bit integers) AVX2 register IDX ?

gcc compiles the straight forward implementation into a sequence of shuffle/extract/movss instructions (see attached assembler listing)

for(int i=0;i<8;i++) array[IDX[i]] = A[i];

My question is: can this be improved by hand coded intrinsics/assembly ? Note: I am aware that SIMD gather/scatter performance is normall masked/limited by memory bandwith, but here the assumption is, that all data resides in the L1 or L2 cache

Original Q&A

TechQA.

Is there a more efficient way to da an AVX(2) scatter than the following code generated by gcc?

There are 0 answers

Related Questions in INTRINSICS

Related Questions in AVX

Related Questions in SCATTER

Related Questions in AVX2

Popular Questions

Trending Questions