Is there a more efficient way to da an AVX(2) scatter than the following code generated by gcc?

140 views Asked by At

What is the most efficient way to scatter 8x32 bit floats in a AVX2 register A to memory locations indexed by another (8x32 bit integers) AVX2 register IDX ?

gcc compiles the straight forward implementation into a sequence of shuffle/extract/movss instructions (see attached assembler listing)

for(int i=0;i<8;i++) array[IDX[i]] = A[i];

enter image description here

My question is: can this be improved by hand coded intrinsics/assembly ? Note: I am aware that SIMD gather/scatter performance is normall masked/limited by memory bandwith, but here the assumption is, that all data resides in the L1 or L2 cache

0

There are 0 answers