Fastest way for indexed array stores in AVX512?

454 views Asked by At

I have an operation of the form:

for (I=0;I<31;I++)
{
 dst[index1[I]]=src1[I];
 dst[index2[I]]=src2[I];
}

All the data arrays have 128b elements. [edited]

I am not sure what is the best way to implement it in AVX512. I can load the sources in zmm registers but the best I can do afterwards is use extracts and 128b stores. Any suggestions?

1

There are 1 answers

0
Peter Cordes On

On current CPUs, AVX-512 scatter instructions are not super fast, less than one 64-bit element per clock cycle on Skylake-X, and just over 1 qword / clock on Ice Lake1. Manual scatter should be better than emulating a 128-bit scatter in terms of a 64-bit scatter instruction, but you could benchmark both ways if you're curious.

Especially if the indices can overlap (collide) between index1 and index2, 4 separate 128-bit stores is almost certainly better than checking for collisions between pairs of index vectors. Note that scattering 4x elements from src1 and then 4x elements from src2 would give a different final result if idx1[1] == idx2[0]. In original source order, that dst element would get src1[1], but if you're not careful it would get src2[0].

With 128-bit elements, probably just do 512-bit loads and manually scatter with vmovdqu xmm (via _mm512_castsi512_si128 / _mm_storeu_si128) and 3x _mm512_extracti64x2_epi64 (VEXTRACTI64x2) stores.

Or 256-bit loads and vmovdqu xmm + vextracti128 stores. But if you're using 512-bit vectors in the surrounding code, you might as well use them here; you're already paying the CPU frequency and execution-port-shutdown costs.

It might be good if you can get the compiler to do 64-bit loads of index data, and separate 32-bit halves with mov eax, edx / shr rdx, 32, to save on memory load/store ports. That maybe possible with a GNU C typedef uint64_t aliasing_u64 __attribute((may_alias,aligned(4)));.


Footnote 1: e.g. vpscatterdq zmm on Skylake-X has a throughput of one per 11 cycles, best-case. Or on Ice Lake, one per 7 cycles. https://uops.info/

So that's 4x 128-bit stores per 11 cycles. Manual 128-bit scatter can probably go at least 1x 128-bit store per 2 cycles, perhaps even 1 per clock. Or maybe even faster on Ice Lake with its 2 store ports and 2 load ports, and wider front-end.

Scatter instructions are also a lot of uops for the front-end: 26 or 19 on SKX or ICL respectively.


But just for fun, emulating 128-bit scatter:

We can emulate a 128-bit-element scatter using a 64-bit-element scatter such as _mm512_i32scatter_epi64 (VPSCATTERDQ). Or _mm512_i64scatter_epi64 (VPSCATTERQQ) if your indices need to be 64-bit, otherwise save memory bandwidth for loading indices.

Generate an index vector that stores consecutive pairs of qword elements to index1[I]*2 and index1[I]*2 + 1.


The scale factor in a scatter can only be 1,2,4, or 8, same as x86 indexed addressing modes. If you can store your "index" as byte offsets instead of element offsets, that could be good for efficiency here. Otherwise you'll have to start by doubling each input vector of indices (by adding it to itself).

Then interleave it with an incremented copy, probably with vpermt2d

void scatter(char *dst, __m512i data_lo, __m512i data_hi, __m256i idx)
{
    idx = _mm256_add_epi32(idx,idx);
    __m256i idx1 = _mm256_sub_epi32(idx, _mm256_set1_epi32(-1));

    const __m256i interleave_lo = _mm256_set_epi32(11,3, 10,2, 9,1,  8,0);
    const __m256i interleave_hi = _mm256_set_epi32(15,7, 14,6, 13,5, 12,4);
    __m256i idx_lo = _mm256_permutex2var_epi32(idx, interleave_lo, idx1);  // vpermt2d
    __m256i idx_hi = _mm256_permutex2var_epi32(idx, interleave_hi, idx1);

    _mm512_i32scatter_epi64(dst, idx_lo, data_lo, 8);
    _mm512_i32scatter_epi64(dst, idx_hi, data_hi, 8);
}

https://godbolt.org/z/TxzjWz shows how it compiles in a loop. (clang fully unrolls it by default.)