Largest number a floating point number can hold while still retaining a certain amount of decimal precision

40 views Asked by At

I would like to know the largest positive number a 32 bit float can hold while still being able to represent approximately 1/1000 decimal resolution.

So, for example if the float represents kilo Watts, how big can the kilo Watt number get before I would lose the ability to convert it to Watts without significant loss of precision (say a few Watts).

2

There are 2 answers

1
aka.nice On BEST ANSWER

I assume that you want the distance between two consecutive Float to be less than 1/1000 to have a precision of 1 watt or better.

This is related to the unit of least precision (ulp) of the Float.

In binary formats, the float magnitude has a general form 1.fractionBits * 2^exponent

If the float has a precision p,

  • its significand, composed by the leading one and the fraction bits, has p bits.
  • there are p-1 fraction bits,
  • the leading 1 represent a quantity 2^exp
  • the first fraction bit a quantity 2^(exp-1)
  • the last fraction bit a quantity 2^(exp-(p-1)) this is the ulp of the float

Now the requirement is ulp < 1/1000. That is 2^(exp+1-p) < 1/1000.

If we enforce the requirement a little, ulp <= 1/1024, that is 2^-10:

 exp+1-p <= -10

So the float exponent must be

exp <= p-11

For IEEE 754

  • single precision, p=24, exp<=13, the float magnitude must be < 2^14, about 16384.0.
  • double precision, p=53, exp<=42, the float magnitude must be < 2^43, that is about 8 * 10^12 approximately

Now, if you want a precision of a few watt, just do the arithmetic. 2 watts make the limit twice higher, 4 watts double the limit again, 8 watts etc...

We can generalize the formulation : if you want a precision of 10^-n, that is 2^(log(10^-n)/log(2)), or 2^(-n*log2(10)).

Thus the exponent must be exp <= p - 1 -n*log2(10).

The limit is then abs(float) < 2^(exp+1), that is abs(float)<2^(p-ceil(n*log2(10))).

5
chux - Reinstate Monica On

Certainly not elegant nor practical for double, yet with 32-bit float, easy enough to try various float values.

#include <float.h>
#include <math.h>
#include <stdio.h>
#define limit (1/1000.0f)

int main(void) {
  float previous = limit;
  float next = limit;
  do {
    previous = next;
    next = nextafterf(previous, FLT_MAX);
  } while (next - previous <= 0.001f);
  //} while (previous + 0.001f > previous);
  printf("%.9g %a\n", nextafterf(previous,0), nextafterf(previous,0));
  printf("%.9g %a\n", previous, previous);
  printf("%.9g %a\n", next, next);
  puts("Done");
}

Output

16383.999 0x1.fffffep+13
16384 0x1p+14
16384.002 0x1.000002p+14
Done

So the previous float value to 16384.0 is at most a 1/1000 step and the next float value more than a 1/1000 step.