How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm_hadd_ps and _mm256_hadd_ps but there is no _mm512_hadd_ps. The Intel intrinsics guide documents _mm512_reduce_add_ps. It doesn't actually correspond to a single instruction but its existence suggests there is an optimal method, but it doesn't appear to be defined in the header files that come with the latest snapshot of GCC and I can't find a definition for it with Google.
I figure "hadd" can be emulated with _mm512_shuffle_ps and _mm512_add_ps or I could use _mm512_extractf32x4_ps to break a 512-bit register into four 128-bit registers but I want to make sure I'm not missing something better.
The INTEL compiler has the following intrinsic defined to do horizontal sums
However, as far as I can tell these are broken into multiple instructions anyway so I don't think you gain anything more than doing the horizontal sum of the upper and lower part of the AVX512 register.
To get the horizontal sum you then do
sum = horizontal_add(low + high)
.I got all this information and functions from Agner Fog's Vector Class Library and the Intel Instrinsics Guide online.