I'm trying to perform a float + integer multiplication without losing too much precision. In particular, my float, mean, is in [0, 1) and the integer, value, is any uint32
.
Since float32 cannot express all uint32 values, my approach was to convert value to uint64
in two 256-bit registers, convert both to float64
, multiply, then, convert back to uint32
. So far, I have something like (using go-asm but hopefully still gets the idea across...)
// first, convert mean to float64
VCVTPS2PD(mean, Y3)
// convert deltas to uint64 in two registers
VEXTRACTI128(U8(0), value, Y1)
VEXTRACTI128(U8(1), value, Y2)
VPMOVZXDQ(Y1, Y1)
VPMOVZXDQ(Y2, Y2)
// convert to float64
VCVTUDQ2PD(Y1, Y1)
VCVTUDQ2PD(Y2, Y2)
// multiply by mean
VMULPD(Y1, Y3, Y1)
VMULPD(Y2, Y3, Y2)
// convert back to int
VCVTPD2UDQ(Y1, Y1)
VCVTPD2UDQ(Y2, Y2)
but then I can't seem to figure out how to blend the two registers back into one register. In other words, I can't seem to find the opposite of the extract operation.
Is there a better approach to this that I'm not considering as well?
EDIT: Some clarifications that came up in the comments,
- Each multiplication (mean, value) is a distinct pair, not a broadcasted scalar multiplication.
- The mean is calculated from some other float32's which are also in the range [0,1). The comments suggested using a fixed point for computation, this is a viable option if it's helpful. To be concrete, the mean is a probability and the operation I'd like is to bias the value based on a mean of probabilities, for which the calculation is beyond this question's scope.
- I have AVX2 and FMA instructions but not AVX512.
- Round to nearest is preferred as this operation will be done repeatedly so I don't want to introduce bias, but if truncation is substantially faster I am interested in the tradeoff. This code is used in an estimation algorithm so slight bias is not catastrophic as long as it's characterized.