C++ floating-point addition with bit shift error issues

Question

C++ floating-point addition with bit shift error issues

99 views Asked by Bobascotch At 14 December 2023 at 06:30

I am trying to calculate floating point addition with only using bit shifts and logical ands(&) and ors(|). I am aware that built in floating point addition is more practical, but this is needed for a class and I am lost. This code passes most cases my professor provided except 1000000000.5 + 0.008. The functions return 1.03436e+09, which is not correct.

#include <iostream>
// Function to reinterpret integer as float
float int_as_float(int value) {
    return *((float*)&value);
}

// Function to reinterpret float as integer
int float_as_int(float value) {
    return *((int*)&value);
}

// Function to add two floats without using floating-point instructions
float add_without_float(float a, float b) {
    // Check if one of the values is zero
    if (a == 0.0) {
        return b;
    }
    else if (b == 0.0) {
        return a;
    }

    // Reinterpret floats as integers for bit manipulation
    int aInt = float_as_int(a);
    int bInt = float_as_int(b);

    // Extracting sign, exponent, and mantissa components
    int signA = (aInt >> 31) & 1; // Extract sign bit of float 'a'
    int signB = (bInt >> 31) & 1; // Extract sign bit of float 'b'

    int expA = ((aInt >> 23) & 0xFF) - 127; // Extract exponent of float 'a' and adjust bias
    int expB = ((bInt >> 23) & 0xFF) - 127; // Extract exponent of float 'b' and adjust bias

    int mantissaA = ((aInt & 0x007FFFFF) | 0x00800000); // extracting the mantissa in the bottom 23 bits
    int mantissaB = ((bInt & 0x007FFFFF) | 0x00800000); // extracting the mantissa in the bottom 23 bits

    int shift = expA - expB;
    if (expA > expB) {
        mantissaB >>= shift; // Shift mantissa of 'b' to align exponents if 'expA > expB'
        expB = expA; // Update exponent of 'b'
    }
    else if (expA < expB) {
        mantissaA >>= -shift; // Shift mantissa of 'a' to align exponents if 'expA < expB'
        expA = expB; // Update exponent of 'a'
    }

    int resultSign;
    int resultMantissa;
    if (signA == signB) {
        resultMantissa = mantissaA + mantissaB;
        resultSign = signA;
    }
    else if (signA == 1) {
        resultMantissa = mantissaB - mantissaA;
        if (resultMantissa < 0) {
            resultSign = 1;
            resultMantissa = -resultMantissa;
        }
        else {
            resultSign = 0;
        }
        mantissaA = 0 - mantissaA; // twos compliment
    }
    else if (signB == 1) {
        resultMantissa = mantissaA - mantissaB;
        if (resultMantissa < 0) {
            resultSign = 1;
            resultMantissa = -resultMantissa;
        }
        else {
            resultSign = 0;
        }
    }

    //resultSign = (resultMantissa >> 31) & 0x1; // if result sign is 1, reverse it.. & 0x1 isolates one bit of 1
    unsigned resultExp = expA; // Adjust the biased exponent for the result
    int result;

    if (resultSign == 1) {
        resultMantissa = 0 - resultMantissa; // twos compliment
    }

    // Normalize the result mantissa in a fixed number of steps
    if (resultMantissa == 0) {
        result = 0;
        return result;
    }
    else if ((resultMantissa & 0x010000000) != 0) {  // 25th bit == 1
        resultMantissa = resultMantissa >> 1;
        resultExp++;
    }
    else {
        while ((resultMantissa & 0x00800000) == 0) { // 24th bit == 1

            resultMantissa = resultMantissa << 1;
            resultExp--;
        }
    }
    // Assemble the result by combining sign, exponent, and mantissa parts
    //std::cout << (resultSign << 31) << std::endl;
    //std::cout << ((resultExp + 127) << 23) << std::endl;
    //std::cout << (resultMantissa & 0x007FFFFF) << std::endl;

    result = (resultSign << 31) | ((resultExp + 127) << 23) | (resultMantissa & 0x007FFFFF);

    return int_as_float(result); // Reinterpret result as a float and return it
}

int main() {
    float a = 1000000000.5;
    float b = 0.008;

    // Perform addition without using floating-point instructions
    float result = add_without_float(a, b);

    std::cout << "Result: " << result << std::endl; // Display the result

    return 0;
}

Any guidance would be greatly appreciated :)

Original Q&A

There are 1 answers

**nielsen** · Accepted Answer · 2023-12-14T10:00:20+00:00

Disregarding the potential problems with type-punning, the problem is in the bit shifting of the mantissa. More precisely here:

   mantissaB >>= shift; // Shift mantissa of 'b' to align exponents if 'expA > expB'

In the test case, shift is 36 which apparently is larger than the bitsize of the int type causing undefined behavior. In your case, the result turns out to be simply wrong.

A fix could be to set mantissaB to 0 if shift is too big:

   mantissaB = (shift < 8*sizeof(int)) ? mantissaB >> shift : 0;

A similar fix must be made for mantissaA.

Note: If you include #include <climits>, then "8" can be replaced by the more correct CHAR_BITS which is not necessarily 8 (though it will be hard to find a system where it is not).

TechQA.

C++ floating-point addition with bit shift error issues

There are 1 answers

Related Questions in C++

Related Questions in FLOATING-POINT

Related Questions in BIT-SHIFT

Popular Questions

Trending Questions