I have an operation of the form:
for (I=0;I<31;I++)
{
dst[index1[I]]=src1[I];
dst[index2[I]]=src2[I];
}
All the data arrays have 128b elements. [edited]
I am not sure what is the best way to implement it in AVX512. I can load the sources in zmm registers but the best I can do afterwards is use extracts and 128b stores. Any suggestions?
On current CPUs, AVX-512 scatter instructions are not super fast, less than one 64-bit element per clock cycle on Skylake-X, and just over 1 qword / clock on Ice Lake1. Manual scatter should be better than emulating a 128-bit scatter in terms of a 64-bit scatter instruction, but you could benchmark both ways if you're curious.
Especially if the indices can overlap (collide) between index1 and index2, 4 separate 128-bit stores is almost certainly better than checking for collisions between pairs of index vectors. Note that scattering 4x elements from src1 and then 4x elements from src2 would give a different final result if
idx1[1] == idx2[0]
. In original source order, that dst element would getsrc1[1]
, but if you're not careful it would getsrc2[0]
.With 128-bit elements, probably just do 512-bit loads and manually scatter with
vmovdqu xmm
(via_mm512_castsi512_si128
/_mm_storeu_si128
) and 3x_mm512_extracti64x2_epi64
(VEXTRACTI64x2) stores.Or 256-bit loads and
vmovdqu xmm
+vextracti128
stores. But if you're using 512-bit vectors in the surrounding code, you might as well use them here; you're already paying the CPU frequency and execution-port-shutdown costs.It might be good if you can get the compiler to do 64-bit loads of index data, and separate 32-bit halves with
mov eax, edx
/shr rdx, 32
, to save on memory load/store ports. That maybe possible with a GNU Ctypedef uint64_t aliasing_u64 __attribute((may_alias,aligned(4)));
.Footnote 1: e.g.
vpscatterdq zmm
on Skylake-X has a throughput of one per 11 cycles, best-case. Or on Ice Lake, one per 7 cycles. https://uops.info/So that's 4x 128-bit stores per 11 cycles. Manual 128-bit scatter can probably go at least 1x 128-bit store per 2 cycles, perhaps even 1 per clock. Or maybe even faster on Ice Lake with its 2 store ports and 2 load ports, and wider front-end.
Scatter instructions are also a lot of uops for the front-end: 26 or 19 on SKX or ICL respectively.
But just for fun, emulating 128-bit scatter:
We can emulate a 128-bit-element scatter using a 64-bit-element scatter such as
_mm512_i32scatter_epi64
(VPSCATTERDQ). Or_mm512_i64scatter_epi64
(VPSCATTERQQ) if your indices need to be 64-bit, otherwise save memory bandwidth for loading indices.Generate an index vector that stores consecutive pairs of qword elements to
index1[I]*2
andindex1[I]*2 + 1
.The scale factor in a scatter can only be 1,2,4, or 8, same as x86 indexed addressing modes. If you can store your "index" as byte offsets instead of element offsets, that could be good for efficiency here. Otherwise you'll have to start by doubling each input vector of indices (by adding it to itself).
Then interleave it with an incremented copy, probably with
vpermt2d
https://godbolt.org/z/TxzjWz shows how it compiles in a loop. (clang fully unrolls it by default.)