What is the most efficient way to scatter 8x32 bit floats in a AVX2 register A to memory locations indexed by another (8x32 bit integers) AVX2 register IDX ?
gcc compiles the straight forward implementation into a sequence of shuffle/extract/movss instructions (see attached assembler listing)
for(int i=0;i<8;i++) array[IDX[i]] = A[i];
My question is: can this be improved by hand coded intrinsics/assembly ? Note: I am aware that SIMD gather/scatter performance is normall masked/limited by memory bandwith, but here the assumption is, that all data resides in the L1 or L2 cache
