Costs of new AVX512 instruction - Scatter store

1k views Asked by At

I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them.

What I try is to interleave specific data, selected by a mask. My little benchmark loads x*32 byte of aligned data from memory into two vector registers and compresses them using a dynamic mask (fig. 1). The resulting vector registers are scattered into the memory, so that the two vector registers are interleaved (fig. 2).

Compression of the two vector register

Figure 1: Compressing the two data vector registers using the same dynamically created mask.

Scatter store to interleave

Figure 2: Scatter store to interleave the compressed data.

My code looks like the following:

void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) {
   /* Initialize a vector register containing zeroes to get the store mask */
   __m512i zeroVec     = _mm512_setzero_epi32();
   /* Load data */
   __m512i dataVec_1   = _mm512_conflict_epi32( data );
   __m512i dataVec_2   = _mm512_conflict_epi32( data + 16 );
   /* Compress the data */
   __m512i compVec_1   = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 );
   __m512i compVec_2   = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 );

   /* Get the store mask by compare the compressed register with the zero-register (4 means !=) */
   __mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 );

   /* Interleave the selected data */
   _mm512_mask_i32scatter_epi32(
      result,
      maskStore,
      vindex,
      compVec_1,
      1
   );
   _mm512_mask_i32scatter_epi32(
      result + 1,
      maskStore,
      vindex,
      compVec_2,
      1
   );
}

I compiled everything with

-O3 -march=knl -lmemkind -mavx512f -mavx512pf

I call the method for 100'000'000 elements. To actually get an overview of the behaviour of the scatter store I repeated this measurement with different values for maskCompress. I expected some kind of dependence between the time needed for execution and the number of set bits within the maskCompress. But I observed, that the tests needed roughly the same time for execution. Here is the result of the performance test: Results of the measurement Figure 3: Results of the measurements. The x-axis represents the number of written elements, depending on maskCompressed. The y-axis shows the performance.

As one can see, the performance is getting higher when more data is actual written to memory.

I did a little bit of research and came up to this: Instruction latency of avx512. Following the given link, the latency of the used instructions are constant. But to be honest, I am a little bit confused about this behaviour.

Regarding to the answers from Christoph and Peter, I changed my approach a little bit. Thus I have no idea how I can use unpackhi / unpacklo to interleave sparse vector registers, I just combined the AVX512 compress intrinsic with a shuffle (vpermi):

int zip_store_vpermit_cnt(
  uint32_t const * const data, 
  int const compressMask, 
  uint32_t * const result,
  std::ofstream & log   
  ) {
  __m512i data1 = _mm512_undefined_epi32();
  __m512i data2 = _mm512_undefined_epi32();
  __m512i comp_vec1 = _mm512_undefined_epi32();
  __m512i comp_vec2 = _mm512_undefined_epi32();
  __mmask16 comp_mask = compressMask;
  __mmask16 shuffle_mask;
  uint32_t store_mask = 0;
  __m512i shuffle_idx_lo = _mm512_set_epi32(
    23, 7, 22, 6, 
    21, 5, 20, 4,
    19, 3, 18, 2,
    17, 1, 16, 0 );
  __m512i shuffle_idx_hi = _mm512_set_epi32(
    31, 15, 30, 14,
    29, 13, 28, 12,
    27, 11, 26, 10,
    25, 9, 24, 8 );
  std::size_t pos = 0;
  int pcount = 0;
  int fullVec = 0;
  for( std::size_t i = 0; i < ELEM_COUNT; i += 32 ) {
    /* Loading the current data */
    data1 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i]) ) );
    data2 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i+16]) ) );
    shuffle_mask = _mm512_cmp_epi32_mask( zero, data2, 4 );
    /* Interleaving the two vector register, depending on the compressMask */
    pcount = 2*( __builtin_popcount( comp_mask ) );
    store_mask = std::pow( 2, (pcount) ) - 1;
    fullVec = pcount / 17;
    comp_vec1 = _mm512_permutex2var_epi32( data1, shuffle_idx_lo, data2 );
    _mm512_mask_storeu_epi32( &(result[pos]), store_mask, comp_vec1 );
    pos += (fullVec) * 16 + ( ( 1 - ( fullVec ) ) * pcount ); // same as pos += ( pCount >= 16 ) ? 16 : pCount;
    _mm512_mask_storeu_epi32( &(result[pos]), (store_mask >> 16) , comp_vec2 );
    pos += ( fullVec ) * ( pcount - 16 );                     // same as pos += ( pCount >= 16 ) ? pCount - 16 : 0;

    //a simple _mm512_store_epi32 produces a segfault, because the memory isn't aligned anymore :(

  }
  return pos;
}

That way the sparse data within the two vector registers can be interleaved. Unfortunately I have to manually calculate the mask for the store. This seems to be quite expensive. One could use a LUT to avoid the calculation, but I think that is not the way it should be.

Performance of storing Figure 4: Results of the performance test of 4 different kinds of store.

I know that this is not the usual way, but I have 3 questions, related to this topic and I am hopefull that one can help me out.

  1. Why should a masked store with only one set bit needs the same time as a masked store where all bits are set?

  2. Does anyone has some experience or is there a good documentation to understand the behaviour of the AVX512 scatter store?

  3. Is there a more easy or more performant way to interleave two vector registers?

Thanks for your help!

Sincerely

0

There are 0 answers