When I convert a 32bit float to a 64bit unsigned integer in C++, everything works as expected. Overflows cause the FE_OVERFLOW flag to be set (cfenv) and return the value 0.
std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint64_t b = a;
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);
But when I convert a 32bit float to a 32bit unsigned integer like this:
std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint32_t b = a;
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);
I behaves exactly the way the 64bit conversion did too except from the upper 32bit being truncated. It is equal to:
std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint64_t b2 = a;
uint32_t b = b2 & numeric_limits<uint32_t>::max();
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);
So the overflow does only occur if the exponent is greater or equal 64 and between exponent 32 and 64 it returns the lower 32bit of the 64bit conversion without setting the overflow. This is very strange, because you would expect it to overflow at exponent 32.
Is this the way it should be, or am I doing something wrong?
Compiler is: LLVM version 6.0 (clang-600.0.45.3) (based on LLVM 3.5svn)
Overflow in the conversion from floating-point to integer is undefined behavior. You cannot rely on it being done with a single assembly instruction or with an instruction that overflows for the exact set of values for which you would like the overflow flag to be set.
The assembly instruction
cvttsd2si
, likely to have been generated, indeed sets flags when it overflows, but a 64-bit variant of the instruction may be generated when converting to a 32-bit int type. A good reason is when truncating a floating-point value to an unsigned 32-bit integer, as in your question, because all 32 low bits of the destination register are set correctly for the floating-point values that cause the conversion to be defined after executing the 64-bit signed instruction. There is no unsigned variant of thecvttsd2si
instruction.From the Intel manual:
This blog post, despite being for C, expands on this subject.