Converting float to uint64 and uint32 behaves strangely

Question

Converting float to uint64 and uint32 behaves strangely

2.1k views Asked by Alexander Meißner At 01 January 2025 at 16:39

When I convert a 32bit float to a 64bit unsigned integer in C++, everything works as expected. Overflows cause the FE_OVERFLOW flag to be set (cfenv) and return the value 0.

std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint64_t b = a;
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);

But when I convert a 32bit float to a 32bit unsigned integer like this:

std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint32_t b = a;
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);

I behaves exactly the way the 64bit conversion did too except from the upper 32bit being truncated. It is equal to:

std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint64_t b2 = a;
uint32_t b = b2 & numeric_limits<uint32_t>::max();
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);

So the overflow does only occur if the exponent is greater or equal 64 and between exponent 32 and 64 it returns the lower 32bit of the 64bit conversion without setting the overflow. This is very strange, because you would expect it to overflow at exponent 32.

Is this the way it should be, or am I doing something wrong?

Compiler is: LLVM version 6.0 (clang-600.0.45.3) (based on LLVM 3.5svn)

Original Q&A

There are 1 answers

**Pascal Cuoq** · Accepted Answer · 2015-06-06T07:53:16+00:00

Overflow in the conversion from floating-point to integer is undefined behavior. You cannot rely on it being done with a single assembly instruction or with an instruction that overflows for the exact set of values for which you would like the overflow flag to be set.

The assembly instruction cvttsd2si, likely to have been generated, indeed sets flags when it overflows, but a 64-bit variant of the instruction may be generated when converting to a 32-bit int type. A good reason is when truncating a floating-point value to an unsigned 32-bit integer, as in your question, because all 32 low bits of the destination register are set correctly for the floating-point values that cause the conversion to be defined after executing the 64-bit signed instruction. There is no unsigned variant of the cvttsd2si instruction.

From the Intel manual:

CVTTSD2SI—Convert with Truncation Scalar Double-Precision FP Value to Signed Integer

…

If a converted result exceeds the range limits of signed doubleword integer (in non-64-bit modes or 64-bit mode with REX.W/VEX.W=0), the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

If a converted result exceeds the range limits of signed quadword integer (in 64-bit mode and REX.W/VEX.W = 1), the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000_00000000H) is returned.

This blog post, despite being for C, expands on this subject.

TechQA.

Converting float to uint64 and uint32 behaves strangely

There are 1 answers

Related Questions in C++

Related Questions in FLOATING-POINT

Related Questions in TYPE-CONVERSION

Popular Questions

Popular Tags

Trending Questions