AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources.
Let's say I need an unsigned multiply with inputs larger than 32-bit, but less or equal to 52-bits - can I simply use the floating point DP multiply or FMA instructions, and will the output be bit-exact when the integer inputs and results can be represented in 52 or fewer bits (i.e., in the range [0, 2^52-1])?
How about the more general case where I want all 104 bits of the product? Or the case where the integer product takes more than 52 bits (i.e., the product has non-zero values in bit indexes > 52) - but I want only the low 52 bits? In this latter case, the MUL
is going to give me higher bits and round away some of the lower bits (perhaps that's what IFMA helps with?).
Edit: in fact, perhaps it could do anything up to 2^53, based on this answer - I had forgotten that the implied leading 1
before the mantissa effectively gives you another bit.
1 Interestingly, the 64-bit product PMULDQ
operation has half the latency and twice the throughput of 32-bit PMULLD
version, as Mysticial explains in the comments.
Yes it's possible. But as of AVX2, it's unlikely to be better than the scalar approaches with MULX/ADCX/ADOX.
There's virtually an unlimited number of variations of this approach for different input/output domains. I'll only cover 3 of them, but they are easy to generalize once you know how they work.
Disclaimers:
Signed doubles in the range: [-251, 251]
This is the simplest one and the only one which is competitive with the scalar approaches. The final scaling is optional depending on what you want to do with the outputs. So this can be considered only 3 instructions. But it's also the least useful since both the inputs and outputs are floating-point values.
It is absolutely critical that both the FMAs stay fused. And this is where fast-math optimizations can break things. If the first FMA is broken up, then
L
is no longer guaranteed to be in the range[-2^51, 2^51]
. If the second FMA is broken up,L
will be completely wrong.Signed integers in the range: [-251, 251]
Building off of the first example, we combine it with a generalized version of the fast
double <-> int64
conversion trick.This one is more useful since you're working with integers. But even with the fast conversion trick, most of the time will be spent doing conversions. Fortunately, you can eliminate some of the input conversions if you are multiplying by the same operand multiple times.
Unsigned integers in the range: [0, 252)
Finally we get the answer to the original question. This builds off of the signed integer solution by adjusting the conversions and adding a correction step.
But at this point, we're at 13 instructions - half of which are high-latency instructions, not counting the numerous
FP <-> int
bypass delays. So it's unlikely this will be winning any benchmarks. By comparison, a64 x 64 -> 128-bit
SIMD multiply can be done in 16 instructions (14 if you pre-process the inputs.)The correction step can be omitted if the rounding mode is round-down or round-to-zero. The only instruction where this matters is
h = _mm256_fmadd_pd(a, b, CONVERT_U);
. So on AVX512, you can override the rounding for that instruction and leave the rounding mode alone.Final Thoughts:
It's worth noting that the 252 range of operation can be reduced by adjusting the magic constants. This may be useful for the first solution (the floating-point one) since it gives you extra mantissa to use for accumulation in floating-point. This lets you bypass the need to constantly to convert back-and-forth between int64 and double like in the last 2 solutions.
While the 3 examples here are unlikely to be better than scalar methods, AVX512 will almost certainly tip the balance. Knights Landing in particular has poor throughput for ADCX and ADOX.
And of course all of this is moot when AVX512-IFMA comes out. That reduces a full
52 x 52 -> 104-bit
product to 2 instructions and gives the accumulation for free.