# Round to IEEE 754 precision but keep binary format

If I convert the decimal number 3120.0005 to float (32-bit) representation, the number gets rounded down to 3120.00048828125.

Assuming we're using a fixed point number with a scale of 10^12 then 1000000000000 = 1.0 and 3120000500000000 = 3120.0005.

What would the formula/algorithm be to round down to the nearest IEEE 754 precision to get 3120000488281250? I would also need a way to get the result of rounding up (3120000732421875).

On Best Solutions

If you divide by the decimal scaling factor, you'll find your nearest representable float. For rounding the other direction, `std::nextafter` can be used:

``````#include <float.h>
#include <math.h>
#include <stdio.h>

long long scale_to_fixed(float f)
{
float intf = truncf(f);
long long result = 1000000000000LL;
result *= (long long)intf;
result += round((f - intf) * 1.0e12);
return result;
}

/* not needed, always good enough to use (float)(n / 1.0e12) */
float scale_from_fixed(long long n)
{
float result = (n % 1000000000000LL) / 1.0e12;
result += n / 1000000000000LL;
return result;
}

int main()
{
long long x = 3120000500000000;
float x_reduced = scale_from_fixed(x);
long long y1 = scale_to_fixed(x_reduced);
long long yfloor = y1, yceil = y1;
if (y1 < x) {
yceil = scale_to_fixed(nextafterf(x_reduced, FLT_MAX));
}
else if (y1 > x) {
yfloor = scale_to_fixed(nextafterf(x_reduced, -FLT_MAX));
}

printf("%lld\n%lld\n%lld\n", yfloor, x, yceil);
}
``````

Results:

3120000488281250

3120000500000000

3120000732421875

On

In order to handle the values as `float` scaled by `1e12` and compute the next larger power of two, e.g. `"rounding up (3120000732421875)"`, the key is understanding that you are looking for the next larger power of two from the 32-bit representation of `x / 1.0e12`. While you can mathematically arrive at this value, a `union` between `float` and `unsigned` (or `uint32_t`) provides a direct way to interpret the stored 32-bit value for the floating-point number as an unsigned value.1

A simple example utilizing a the union `prev` to hold the reduced value of `x` and a separate instance `next` holding the unsigned value (`+1`) can be:

``````#include <stdio.h>
#include <inttypes.h>

int main (void) {

uint64_t x = 3120000500000000;
union {                         /* union between float and uint32_t */
float f;
uint32_t u;
} prev = { .f = x / 1.0e12 },   /* x reduced to float, pwr of 2 as .u */
next = { .u = prev.u + 1u };  /* 2nd union, increment pwr of 2 by 1 */

printf ("prev : %" PRIu64 "\n   x : %" PRIu64 "\nnext : %" PRIu64 "\n",
(uint64_t)(prev.f * 1e12), x, (uint64_t)(next.f * 1e12));
}
``````

Example Use/Output

``````\$ ./bin/pwr2_prev_next
prev : 3120000488281250
x : 3120000500000000
next : 3120000732421875
``````

Footnotes:

1. As an alternative, you can use a pointer to `char` to hold the address of the floating point type and interpret the 4-byte value stored at that location as `unsigned` without running afoul of C11 Standard - ยง6.5 Expressions (p6,7) (the "Strict Aliasing Rule"), but the use of a `union` is preferred.