Implementation of strtof(), floating-point multiplication and mantissa rounding issues

326 views Asked by At

This question is not so much about the C as about the algorithm. I need to implement strtof() function, which would behave exactly the same as GCC one - and do it from scratch (no GNU MPL etc.).

Let's skip checks, consider only correct inputs and positive numbers, e.g. 345.6e7. My basic algorithm is:

  1. Split the number into fraction and integer exponent, so for 345.6e7 fraction is 3.456e2 and exponent is 7.
  2. Create a floating-point exponent. To do this, I use these tables:
static const float powersOf10[] = {
   1.0e1f,
   1.0e2f,
   1.0e4f,
   1.0e8f,
   1.0e16f,
   1.0e32f
};

static const float minuspowersOf10[] = {
   1.0e-1f,
   1.0e-2f,
   1.0e-4f,
   1.0e-8f,
   1.0e-16f,
   1.0e-32f
};

and get float exponent as a product of corresponding bits in integer exponent, e.g. 7 = 1+2+4 => float_exponent = 1.0e1f * 1.0e2f * 1.0e4f.

  1. Multiply fraction by floating exponent and return the result.

And here comes the first problem: since we do a lot of multiplications, we get a somewhat big error becaule of rounding multiplication result each time. So, I decided to dive into floating point multiplication algorithm and implement it myself: a function takes a number of floats (in my case - up to 7) and multiplies them on bit level. Consider I have uint256_t type to fit mantissas product.

Now, the second problem: round mantissas product to 23 bits. I've tried several rounding methods (round-to-even, Von Neumann rounding - a small article about them), but no of them can give the correct result for all the test numbers. And some of them really confuse me, like this one:

7038531e-32. GCC's strtof() returns 0x15ae43fd, so correct unbiased mantissa is 2e43fd. I go for multiplication of 7.038531e6 (biased mantissa d6cc86) and 1e-32 (b.m. cfb11f). The resulting unbiased mantissa in binary form is

( 47)0001 ( 43)0111 ( 39)0010 ( 35)0001
( 31)1111 ( 27)1110 ( 23)1110 ( 19)0010
( 15)1011 ( 11)0101 (  7)0001 (  3)1101

which I have to round to 23 bits. However, by all rounding methods I have to round it up, and I'll get 2e43fe in result - wrong! So, for this number the only way to get correct mantissa is just to chop it - but chopping does not work for other numbers.

Having this worked on countless nights, my questions are:

  1. Is this approach to strtof() correct? (I know that GCC uses GNU MPL for it, and tried to see into it. However, trying to copy MPL's implementation would require porting the entire library, and this is definitely not what I want). Maybe this split-then-multiply algorithm is inevitably prone to errors? I did some other small tricks, (e.g. create exponent tables for all integer exponents in float range), but they led to even more failed conversions.

  2. If so, did I miss something while rounding? I thought so for long time, but this 7038531e-32 number completely confused me.

1

There are 1 answers

0
Spektre On BEST ANSWER

If I want to be as precise as I can I usually do stuff like this (however I usually do the reverse operation float -> text):

  1. use only integers (no floats what so ever)

    as you know float is integer mantissa bit-shifted by integer exponent so no need for floats.

    For constructing the final float datatype you can use simple union with float and 32 bit unsigned integer in it ... or pointers to such types pointing to the same address.

    This will avoid rounding errors for numbers that fit completely and shrink error for those that don't fit considerably.

  2. use hex numbers

    You can convert your text of decadic number on the run into its hex counterpart (still as text) from there creating mantissa and exponent integers is simple.

    Here:

    is C++ implementation example of dec2hex and hex2dec number conversions done on text

  3. use more bits for mantissa while converting

    for task like this and single precision float I usually use 2 or 3 32 bit DWORDs for the 24 bit mantissa to still hold some precision after the multiplications If you want to be precise you have to deal with 128+24 bits for both integer and fractional part of number so 5x32 bit numbers in sequence.

For more info and inspiration see (reverse operation):

Your code will be just inverse of that (so many parts will be similar)

Since I post that I made even more advanced version that recognize formatting just like printf , supports much more datatypes and more without using any libs (however its ~22.5 KByte of code). I needed it for MCUs as GCC implementation of prints are not very good there ...