If I convert the decimal number 3120.0005 to float (32-bit) representation, the number gets rounded down to 3120.00048828125.

Assuming we're using a fixed point number with a scale of 10^12 then 1000000000000 = 1.0 and 3120000500000000 = 3120.0005.

What would the formula/algorithm be to round down to the nearest IEEE 754 precision to get 3120000488281250? I would also need a way to get the result of rounding up (3120000732421875).

2 Answers

2
Ben Voigt On Best Solutions

If you divide by the decimal scaling factor, you'll find your nearest representable float. For rounding the other direction, std::nextafter can be used:

#include <float.h>
#include <math.h>
#include <stdio.h>

long long scale_to_fixed(float f)
{
    float intf = truncf(f);
    long long result = 1000000000000LL;
    result *= (long long)intf;
    result += round((f - intf) * 1.0e12);
    return result;
}

/* not needed, always good enough to use (float)(n / 1.0e12) */
float scale_from_fixed(long long n)
{
    float result = (n % 1000000000000LL) / 1.0e12;
    result += n / 1000000000000LL;
    return result;
}

int main()
{
    long long x = 3120000500000000;
    float x_reduced = scale_from_fixed(x);
    long long y1 = scale_to_fixed(x_reduced);
    long long yfloor = y1, yceil = y1;
    if (y1 < x) {
        yceil = scale_to_fixed(nextafterf(x_reduced, FLT_MAX));
    }
    else if (y1 > x) {
        yfloor = scale_to_fixed(nextafterf(x_reduced, -FLT_MAX));
    }

    printf("%lld\n%lld\n%lld\n", yfloor, x, yceil);
}

Results:

3120000488281250

3120000500000000

3120000732421875

1
David C. Rankin On

In order to handle the values as float scaled by 1e12 and compute the next larger power of two, e.g. "rounding up (3120000732421875)", the key is understanding that you are looking for the next larger power of two from the 32-bit representation of x / 1.0e12. While you can mathematically arrive at this value, a union between float and unsigned (or uint32_t) provides a direct way to interpret the stored 32-bit value for the floating-point number as an unsigned value.1

A simple example utilizing a the union prev to hold the reduced value of x and a separate instance next holding the unsigned value (+1) can be:

#include <stdio.h>
#include <inttypes.h>

int main (void) {

    uint64_t x = 3120000500000000;
    union {                         /* union between float and uint32_t */
        float f;
        uint32_t u;
    } prev = { .f = x / 1.0e12 },   /* x reduced to float, pwr of 2 as .u */
      next = { .u = prev.u + 1u };  /* 2nd union, increment pwr of 2 by 1 */

    printf ("prev : %" PRIu64 "\n   x : %" PRIu64 "\nnext : %" PRIu64 "\n", 
            (uint64_t)(prev.f * 1e12), x, (uint64_t)(next.f * 1e12));
}

Example Use/Output

$ ./bin/pwr2_prev_next
prev : 3120000488281250
   x : 3120000500000000
next : 3120000732421875

Footnotes:

1. As an alternative, you can use a pointer to char to hold the address of the floating point type and interpret the 4-byte value stored at that location as unsigned without running afoul of C11 Standard - ยง6.5 Expressions (p6,7) (the "Strict Aliasing Rule"), but the use of a union is preferred.