I'm searching for information on the divrem intrinsic sequences and their memory requirements (for the store).
These folks (check SSE and SVML to see the intel intrinsics doc) :
__m128i _mm_idivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_idivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)
__m128i _mm_udivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_udivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)
On the intel intrinsics guide, it states.
Divide packed 32-bit integers in a by packed elements in b, store the truncated results in dst, and store the remainders as packed 32-bit integers into memory at mem_addr.
FOR j := 0 to 3
i := 32*j
dst[i+31:i] := TRUNCATE(a[i+31:i] / b[i+31:i])
MEM[mem_addr+i+31:mem_addr+i] := REMAINDER(a[i+31:i] / b[i+31:i])
ENDFOR
dst[MAX:128] := 0
Does this mean, mem_addr is expected to be aligned (as per store), unaligned (storeu), or is it supposed to be a single register output (__m128i on the stack)?
alignof(__m256i) == 32, so for portability to any other compilers that might implement this intrinsic (like clang-based ICX), you should point it at aligned memory, or just a__m128i/__m256itemporary and use a normal store intrinsic (store or storeu) to tell the compiler where you want it to go.As Homer512 points out with an example in https://godbolt.org/z/9szzjEo7c , ICC stores it with
movdqu. But we can see it always uses unaligned loads/stores, also for deref of__m128i*pointers for inputs. GCC and clang do use alignment-required loads/stores when you promise them alignment (e.g. by deref of a__m128i*).The actual SVML function
call QWORD PTR [__svml_idivrem4@GOTPCREL+rip]returns in XMM0 and XMM1; the by-reference output operand is fortunately an invention of the intrinsics API. So it will fully optimize away to pass the address of__m128i tmpand then store it somewhere.