Handling loss of precision in subtracting two doubles that are close to each other

3.3k views Asked by At

I have a project to do where we are to solve the matrix equation AX=B for x, given that A is a tridiagonal matrix. I did this project in C++, got the program to produce the right Matrix X, but when trying to report the error back to the user, A*X-B, I get an erroneous error!! It is due to the fact that I am subtracing A*X and B, whose entries are arbitrarily close to each other. I had two ideas on how to handle this, element-by-element:

  1. According to this article, http://en.wikipedia.org/wiki/Loss_of_significance, there could be as many as -log2(1-y/x) bits lost in straight subtraction x-y. Let's scale both x and y by pow(2,bitsLost), subtract the two, and then scale them back down by dividing by pow(2,bitsLost)
  2. Stressed so much in the numeric methods course this is for: take the arithmetic conjugate! Instead of double difference = x-y; use double difference = (x*x-y*y)/(x+y);

OK, so why haven't you chose a method and moved on?

I tried all three methods (including straight subtraction) here: http://ideone.com/wfkEUp . I would like to know two things:

  1. Between the "scaling and descaling" method (for which I intentionally chose a power of two) and the arithmetic conjugate method, which one produces less error (in terms of subtracting the large numbers)?
  2. Which method is computationally more efficient? /*For this, I was going to say the scaling method was going to be more efficient with a linear complexity versus the seemed quadratic complexity of the conjugate method, but I don't know the complexity of log2()*/

Any and all help would be welcomed!!

P.S.: All three methods seem to return the same double in the sample code...


Let's see some of your code No problem; here is my Matrix.cpp code

#include "ExceptionType.h"
#include "Matrix.h"
#include "MatrixArithmeticException.h"
#include <iomanip>
#include <iostream>
#include <vector>

Matrix::Matrix()
{
    //default size for Matrix is 1 row and 1 column, whose entry is 0
    std::vector<long double> rowVector(1,0);
    this->matrixData.assign(1, rowVector);
}

Matrix::Matrix(const std::vector<std::vector<long double> >& data)
{
    this->matrixData = data;
    //validate matrixData
    validateData();
}

//getter functions
//Recall that matrixData is a vector of a vector, whose elements should be accessed like matrixData[row][column].
//Each rowVector should have the same size.
unsigned Matrix::getRowCount() const { return matrixData.size(); }

unsigned Matrix::getColumnCount() const { return matrixData[0].size(); }

//matrix validator should just append zeroes into row vectors that are of smaller dimension than they should be...
void Matrix::validateData()
{
    //fetch the size of the largest-dimension rowVector
    unsigned largestSize = 0;
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        if (largestSize < matrixData[i].size())
            largestSize = matrixData[i].size();
    }
    //make sure that all rowVectors are of that dimension
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        //if we find a rowVector where this isn't the case
        if (matrixData[i].size() < largestSize)
        {
            //add zeroes to it so that it becomes the case
            matrixData[i].insert(matrixData[i].end(), largestSize-matrixData[i].size(), 0);
        }
    }

}
//operators
//+ and - operators should check to see if the size of the first matrix is exactly the same size as that of the second matrix
Matrix Matrix::operator+(const Matrix& B)
{
    //if the sizes coincide
    if ((getRowCount() == B.getRowCount()) && (getColumnCount() == B.getColumnCount()))
    {
        //declare the matrixData
        std::vector<std::vector<long double> > summedData = B.matrixData;    //since we are in the scope of the Matrix, we can access private data members
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < getColumnCount(); j++)
            {
                summedData[i][j] += matrixData[i][j];   //add the elements together
            }
        }
        //return result Matrix
        return Matrix(summedData);
    }
    else
        throw MatrixArithmeticException(DIFFERENT_DIMENSIONS);
}

Matrix Matrix::operator-(const Matrix& B)
{
    //declare negativeB
    Matrix negativeB = B;
    //negate all entries
    for (unsigned i = 0; i < negativeB.getRowCount(); i++)
    {
        for (unsigned j = 0; j < negativeB.getColumnCount(); j++)
        {
            negativeB.matrixData[i][j] = 0-negativeB.matrixData[i][j];
        }
    }
    //simply add the negativeB
    try
    {
        return ((*this)+negativeB);
    }
    catch (MatrixArithmeticException& mistake)
    {
        //should exit or do something similar
        std::cout << mistake.what() << std::endl;
    }
}

Matrix Matrix::operator*(const Matrix& B)
{
    //the columnCount of the left operand must be equal to the rowCount of the right operand
    if (getColumnCount() == B.getRowCount())
    {
        //if it is, declare data with getRowCount() rows and B.getColumnCount() columns
        std::vector<long double> zeroVector(B.getColumnCount(), 0);
        std::vector<std::vector<long double> > data(getRowCount(), zeroVector);
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < B.getColumnCount(); j++)
            {
                long double sum = 0; //set sum to zero
                for (unsigned k = 0; k < getColumnCount(); k++)
                {
                    //add the product of matrixData[i][k] and B.matrixData[k][j] to sum
                    sum += (matrixData[i][k]*B.matrixData[k][j]);
                }
                data[i][j] = sum;   //assign the sum to data
            }
        }
        return Matrix(data);
    }
    else
    {
        throw MatrixArithmeticException(ROW_COLUMN_MISMATCH); //dimension mismatch
    }
}

std::ostream& operator<<(std::ostream& outputStream, const Matrix& theMatrix)
{
    //Here, you should use the << again, just like you would for ANYTHING ELSE.
    //first, print a newline
    outputStream << "\n";
    //setting precision (optional)
    outputStream.precision(11);
    for (unsigned i = 0; i < theMatrix.getRowCount(); i++)
    {
        //print '['
        outputStream << "[";
        //format stream(optional)
        for (unsigned j = 0; j < theMatrix.getColumnCount(); j++)
        {
            //print numbers
            outputStream << std::setw(17) << theMatrix.matrixData[i][j];
            //print ", "
            if (j < theMatrix.getColumnCount() - 1)
                outputStream << ", ";
        }
        //print ']'
        outputStream << "]\n";
    }
    return outputStream;
}
6

There are 6 answers

4
leemes On BEST ANSWER

You computed two numbers x and y which are of a limited precision floating point type. This means that they are already rounded somehow, meaning loss of precision while computing the result. If you subtract those numbers afterwards, you compute the difference between those two already rounded numbers.

The formula you wrote gives you the maximum error for computing the difference, but this error is with regard to the stored intermediate results x and y (again: rounded). No method other than x-y will give you a "better" result (in terms of the complete computation, not only the difference). To put it in a nutshell: the difference can't be more accurate using any formula other than x-y.

I'd suggest taking a look at arbitrary precision arithmetic math libraries like GMP or Eigen. Use such libraries for computing your equation system. Don't use double for the matrix computations. This way you can make sure that the intermediate results x and y (or the matrices Ax and B) are as precise as you want them to be, for example 512 bits, which should definitely be enough for most cases.

2
V-X On

replace equality by a check for difference bigger than some given epsilon (a constant with meaning as minimal distinguishable difference).

1
Silas On

You cannot expect to get infinite precision with floating point numbers. You should consider what precision is needed, and then choose the simplest method that satisfies your needs. So if you get the same result then stick with normal subtraction and use an epsilon as suggested in V-X's answer.

How do you end up with a O(n^2) complexity for the conjugate method? You have a fixed set of operations, two additions, one subtraction and one division. Assuming all three operations are O(1) then you have get O(n) for applying it to n numbers.

1
David Heffernan On

Finite precision floating point data types cannot represent all possible real values. There are an infinite number of different values, and so it is easy to see that not all values are representable in a type of finite size.

So it's perfectly plausible that your true solution will be a non-representable value. No amount of trickery can get you an exact solution in the finite data type.

You need to re-calibrate your expectations to match the reality of finite precision floating point data types. The starting point is What Every Computer Scientist Should Know About Floating-Point Arithmetic.

2
Riot On

While this may not help you choose a method, a while ago I wrote a tool that may help you choose a precision based on the sorts of values you're expecting:

http://riot.so/floatprecision.html

As other answers have said, you can't expect to get infinite precision with floating point, but you can use tools such as this to obtain the minimum increment and decrement size of a given number, and work out what is the optimal precision to use to get the accuracy you need.

4
Mike Warren On

To all the people answering the question: I knew, and figured out by accident, that the cardinality of the set of all possible doubles was finite. I suppose I have no choice but to either try a higher-precision number, or create my own class that represents a HugeDecimal.