I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them.
What I try is to interleave specific data, selected by a mask. My little benchmark loads x*32 byte of aligned data from memory into two vector registers and compresses them using a dynamic mask (fig. 1). The resulting vector registers are scattered into the memory, so that the two vector registers are interleaved (fig. 2).
Figure 1: Compressing the two data vector registers using the same dynamically created mask.
Figure 2: Scatter store to interleave the compressed data.
My code looks like the following:
void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) {
/* Initialize a vector register containing zeroes to get the store mask */
__m512i zeroVec = _mm512_setzero_epi32();
/* Load data */
__m512i dataVec_1 = _mm512_conflict_epi32( data );
__m512i dataVec_2 = _mm512_conflict_epi32( data + 16 );
/* Compress the data */
__m512i compVec_1 = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 );
__m512i compVec_2 = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 );
/* Get the store mask by compare the compressed register with the zero-register (4 means !=) */
__mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 );
/* Interleave the selected data */
_mm512_mask_i32scatter_epi32(
result,
maskStore,
vindex,
compVec_1,
1
);
_mm512_mask_i32scatter_epi32(
result + 1,
maskStore,
vindex,
compVec_2,
1
);
}
I compiled everything with
-O3 -march=knl -lmemkind -mavx512f -mavx512pf
I call the method for 100'000'000 elements. To actually get an overview of the behaviour of the scatter store I repeated this measurement with different values for maskCompress. I expected some kind of dependence between the time needed for execution and the number of set bits within the maskCompress. But I observed, that the tests needed roughly the same time for execution. Here is the result of the performance test: Figure 3: Results of the measurements. The x-axis represents the number of written elements, depending on maskCompressed. The y-axis shows the performance.
As one can see, the performance is getting higher when more data is actual written to memory.
I did a little bit of research and came up to this: Instruction latency of avx512. Following the given link, the latency of the used instructions are constant. But to be honest, I am a little bit confused about this behaviour.
Regarding to the answers from Christoph and Peter, I changed my approach a little bit. Thus I have no idea how I can use unpackhi / unpacklo to interleave sparse vector registers, I just combined the AVX512 compress intrinsic with a shuffle (vpermi):
int zip_store_vpermit_cnt(
uint32_t const * const data,
int const compressMask,
uint32_t * const result,
std::ofstream & log
) {
__m512i data1 = _mm512_undefined_epi32();
__m512i data2 = _mm512_undefined_epi32();
__m512i comp_vec1 = _mm512_undefined_epi32();
__m512i comp_vec2 = _mm512_undefined_epi32();
__mmask16 comp_mask = compressMask;
__mmask16 shuffle_mask;
uint32_t store_mask = 0;
__m512i shuffle_idx_lo = _mm512_set_epi32(
23, 7, 22, 6,
21, 5, 20, 4,
19, 3, 18, 2,
17, 1, 16, 0 );
__m512i shuffle_idx_hi = _mm512_set_epi32(
31, 15, 30, 14,
29, 13, 28, 12,
27, 11, 26, 10,
25, 9, 24, 8 );
std::size_t pos = 0;
int pcount = 0;
int fullVec = 0;
for( std::size_t i = 0; i < ELEM_COUNT; i += 32 ) {
/* Loading the current data */
data1 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i]) ) );
data2 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i+16]) ) );
shuffle_mask = _mm512_cmp_epi32_mask( zero, data2, 4 );
/* Interleaving the two vector register, depending on the compressMask */
pcount = 2*( __builtin_popcount( comp_mask ) );
store_mask = std::pow( 2, (pcount) ) - 1;
fullVec = pcount / 17;
comp_vec1 = _mm512_permutex2var_epi32( data1, shuffle_idx_lo, data2 );
_mm512_mask_storeu_epi32( &(result[pos]), store_mask, comp_vec1 );
pos += (fullVec) * 16 + ( ( 1 - ( fullVec ) ) * pcount ); // same as pos += ( pCount >= 16 ) ? 16 : pCount;
_mm512_mask_storeu_epi32( &(result[pos]), (store_mask >> 16) , comp_vec2 );
pos += ( fullVec ) * ( pcount - 16 ); // same as pos += ( pCount >= 16 ) ? pCount - 16 : 0;
//a simple _mm512_store_epi32 produces a segfault, because the memory isn't aligned anymore :(
}
return pos;
}
That way the sparse data within the two vector registers can be interleaved. Unfortunately I have to manually calculate the mask for the store. This seems to be quite expensive. One could use a LUT to avoid the calculation, but I think that is not the way it should be.
Figure 4: Results of the performance test of 4 different kinds of store.
I know that this is not the usual way, but I have 3 questions, related to this topic and I am hopefull that one can help me out.
Why should a masked store with only one set bit needs the same time as a masked store where all bits are set?
Does anyone has some experience or is there a good documentation to understand the behaviour of the AVX512 scatter store?
Is there a more easy or more performant way to interleave two vector registers?
Thanks for your help!
Sincerely