SSE divrem memory store requirements

Question

SSE divrem memory store requirements

73 views Asked by scx At 21 December 2022 at 22:04

I'm searching for information on the divrem intrinsic sequences and their memory requirements (for the store).

These folks (check SSE and SVML to see the intel intrinsics doc) :

__m128i _mm_idivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_idivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)
__m128i _mm_udivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_udivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)

On the intel intrinsics guide, it states.

Divide packed 32-bit integers in a by packed elements in b, store the truncated results in dst, and store the remainders as packed 32-bit integers into memory at mem_addr.

FOR j := 0 to 3
    i := 32*j
    dst[i+31:i] := TRUNCATE(a[i+31:i] / b[i+31:i])
    MEM[mem_addr+i+31:mem_addr+i] := REMAINDER(a[i+31:i] / b[i+31:i])
ENDFOR
dst[MAX:128] := 0

Does this mean, mem_addr is expected to be aligned (as per store), unaligned (storeu), or is it supposed to be a single register output (__m128i on the stack)?

Original Q&A

There are 1 answers

**Peter Cordes** · Accepted Answer · 2022-12-22T03:28:59+00:00

alignof(__m256i) == 32, so for portability to any other compilers that might implement this intrinsic (like clang-based ICX), you should point it at aligned memory, or just a __m128i / __m256i temporary and use a normal store intrinsic (store or storeu) to tell the compiler where you want it to go.

As Homer512 points out with an example in https://godbolt.org/z/9szzjEo7c , ICC stores it with movdqu. But we can see it always uses unaligned loads/stores, also for deref of __m128i* pointers for inputs. GCC and clang do use alignment-required loads/stores when you promise them alignment (e.g. by deref of a __m128i*).

The actual SVML function call QWORD PTR [__svml_idivrem4@GOTPCREL+rip] returns in XMM0 and XMM1; the by-reference output operand is fortunately an invention of the intrinsics API. So it will fully optimize away to pass the address of __m128i tmp and then store it somewhere.

TechQA.

SSE divrem memory store requirements

There are 1 answers

Related Questions in MEMORY

Related Questions in SIMD

Related Questions in SSE

Related Questions in INTRINSICS

Related Questions in ICC

Popular Questions

Trending Questions