How to generate an IEEE 754 Single-precision float using only integer arithmetic?

1.9k views Asked by At

Assuming a low end microprocessor with no floating point arithmetic, I need to generate an IEE754 single precision floating point format number to push out to a file.

I need to write a function that takes three integers being the sign, whole and the fraction and returns a byte array with 4 bytes being the IEEE 754 single precision representation.

Something like:

// Convert 75.65 to 4 byte IEEE 754 single precision representation
char* float = convert(0, 75, 65);

Does anybody have any pointers or example C code please? I'm particularly struggling to understand how to convert the mantissa.

5

There are 5 answers

0
chux - Reinstate Monica On

The basic premise is to:

  1. Given binary32 float.
  2. Form a binary fixed-point representation of the combined whole and factional parts hundredths. This code uses a structure encoding both whole and hundredths fields separately. Important that the whole field is at least 32 bits.
  3. Shift left/right (*2 and /2) until MSbit is in the implied bit position whilst counting the shifts. A robust solution would also note non-zero bits shifted out.
  4. Form a biased exponent.
  5. Round mantissa and drop implied bit.
  6. Form sign (not done here).
  7. Combine the above 3 steps to form the answer.
  8. As Sub-normals, infinites & Not-A-Number will not result with whole, hundredths input, generating those float special cases are not addressed here.

.

#include <assert.h>
#include <stdint.h>
#define IMPLIED_BIT 0x00800000L

typedef struct {
  int_least32_t whole;
  int hundreth;
} x_xx;

int_least32_t covert(int whole, int hundreth) {
  assert(whole >= 0 && hundreth >= 0 && hundreth < 100);
  if (whole == 0 && hundreth == 0) return 0;
  x_xx x = { whole, hundreth };
  int_least32_t expo = 0;
  int sticky_bit = 0; // Note any 1 bits shifted out
  while (x.whole >= IMPLIED_BIT * 2) {
    expo++;
    sticky_bit |= x.hundreth % 2;
    x.hundreth /= 2;
    x.hundreth += (x.whole % 2)*(100/2);
    x.whole /= 2;
  }
  while (x.whole < IMPLIED_BIT) {
    expo--;
    x.hundreth *= 2;
    x.whole *= 2;
    x.whole += x.hundreth / 100;
    x.hundreth %= 100;
  }
  int32_t mantissa = x.whole;
  // Round to nearest - ties to even
  if (x.hundreth >= 100/2 && (x.hundreth > 100/2 || x.whole%2 || sticky_bit)) {
    mantissa++;
  }
  if (mantissa >= (IMPLIED_BIT * 2)) {
    mantissa /= 2;
    expo++;
  }
  mantissa &= ~IMPLIED_BIT;  // Toss MSbit as it is implied in final
  expo += 24 + 126; // Bias: 24 bits + binary32 bias
  expo <<= 23; // Offset
  return expo | mantissa;
}

void test_covert(int whole, int hundreths) {
  union {
    uint32_t u32;
    float f;
  } u;
  u.u32 = covert(whole, hundreths);
  volatile float best = whole + hundreths / 100.0;
  printf("%10d.%02d --> %15.6e %15.6e Same:%d\n", whole, hundreths, u.f, best,
      best == u.f);
}

#include <limits.h>
int main(void) {
  test_covert(75, 65);
  test_covert(0, 1);
  test_covert(INT_MAX, 99);
  return 0;

}

Output

        75.65 -->    7.565000e+01    7.565000e+01 Same:1
         0.01 -->    1.000000e-02    1.000000e-02 Same:1
2147483647.99 -->    2.147484e+09    2.147484e+09 Same:1

Known issues: sign not applied.

0
Daniel On

You can use a software floating point compiler/library.
See https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html

0
David R Tribble On

You will need to generate the sign (1 bit), the exponent (8 bits, a biased power of 2), and the fraction/mantissa (23 bits).

Bear in mind that the fraction has an implicit leading '1' bit, which means that the most significant leading '1' bit (2^22) is not stored in the IEEE format. For example, given a fraction of 0x755555 (24 bits), the actual bits stored would be 0x355555 (23 bits).

Also bear in mind that the fraction is shifted so that the binary point is immediately to the right of the implicit leading '1' bit. So an IEEE 23-bit fraction of 11 0101 0101... represents the 24-bit binary fraction 1.11 0101 0101... This means that the exponent has to be adjusted accordingly.

2
too honest for this site On

Does the value have to be written big endian or little endian? Reversed bit ordering?

If you are free, you should think about writing the value as string literal. That way you can easily convert the integer: just write the int part and write "e0" as exponent (or omit the exponent and write ".0").

For the binary representation, you should have a look at Wikipedia. Best is to first assemble the bitfields to an uint32_t - the structure is given in the linked article. Note that you might have to round if the integer has more than 23 bits value. Remember to normalize the generated value.

Second step will be to serialize the uint32_t to an uint8_t-array. Mind the endianess of the result!

Also note to use uint8_t for the result if you really want 8 bit values; you should use an unsigned type. For the intermediate representation, using uint32_t is recommended as that will guarantee you operate on 32 bit values.

2
Persixty On

You haven't had a go yet so no give aways.

Remember you can regard two 32-bit integers a & b to be interpreted as a decimal a.b as being a single 64-bit integer with an exponent of 2^-32 (where ^ is exponent).

So without doing anything you've got it in the form:

s * m * 2^e

The only problem is your mantissa is too long and your number isn't normalized.

A bit of shifting and adding/subtracting with a possible rounding step and you're done.