Are there aggregate operations in x86 AVX?

127 views Asked by At

I am try to writing a simple game and I need to study some x86 assemble for vector operation. Use xmm as 4 packed single-precision floating-point, are there any aggregate operations? Such as:

"MAXPS" to calculate the max of the 4 fp32. (used on Chebyshev Distance or so on)

"SUMPS" to calculate the sum of the 4 fp32. (used on dot product or vector magnitude)

2

There are 2 answers

15
Simon Goater On

One non-loopng, non-branching way to get the maximum float value of an SSE vector would be something like the following.

inline float _mm_hmax_ps(__m128 arg) {
  // Returns the maximum 32 bit float value in arg.
  // Requires SSE.
  __m128 temp, temp2;
  temp = _mm_shuffle_ps(arg, arg, 78);  // 78 = 01001110b
  temp = _mm_max_ps(arg, temp);
  temp2 = _mm_shuffle_ps(temp, temp, 165); // 165 = 10100101b
  temp2 = _mm_max_ps(temp2, temp);
  return _mm_cvtss_f32(temp2);
}

...and an AVX version is as follows.

inline float _mm256_hmax_ps(__m256 arg) {
  // Returns the maximum 32 bit float value in arg.
  // Requires AVX1 & SSE1. 
  __m128 temp128u, temp128l;  
  __m256 temp, temp2;
  temp = _mm256_shuffle_ps(arg, arg, 78);  // 78 = 01001110b
  temp = _mm256_max_ps(arg, temp);
  temp2 = _mm256_shuffle_ps(temp, temp, 165); // 165 = 10100101b
  temp = _mm256_max_ps(temp2, temp);
  temp128u = _mm256_extractf128_ps(temp, 1);
  temp128l = _mm256_extractf128_ps(temp, 0);
  temp128u = _mm_max_ps(temp128u, temp128l);
  return _mm_cvtss_f32(temp128u);
}
0
chtz On

TLDR: There are a few reduction instructions, but usually you should follow: Fastest way to do horizontal SSE vector sum (or other reduction)


The only floating-point instructions (I'm aware of) which do horizontal reductions are

  • haddps/haddpd: These add up two adjacent values each from two input registers and stack them into one output register (with the usual quirk, that AVX256 is like two separate SSE/AVX128 operations). Do not use these, if you want to reduce a single register to a scalar (see the linked answer above).
  • dpps/dppd: Calculate a dot-product of (up to) 4 elements (dpps) or up to two elements (dppd). This is often only worth using if your primary goal is small binary size, or with special cases of masking input/output. The dppd instruction apparently is not even ported to 256bit at all. Also, never use this as building block for a large dot-products (these should use only fma instructions in the main-loop with a horizontal reduction at the end).