Determining regression coefficients for data - MATLAB

154 views Asked by At

I am doing a project involving scientific computing. The following are three variables and their values I got after some experiments.

Here x,y,z are some variables

There is also an equation with three unknowns, a, b and c:

x=(a+0.98)/y+(b+0.7)/z+c

How do I get values of a,b,c using the above? Is this possible in MATLAB?

1

There are 1 answers

1
rayryeng On BEST ANSWER

This sounds like a regression problem. Assuming that the unexplained errors in measurements are Gaussian distributed, you can find the parameters via least squares. Basically, you'd have to rewrite the equation so that you get this to the form of ma + nb + oc = p and then you have 6 equations with 3 unknowns (a, b, c) and these parameters can be found through optimization by least squares. Therefore, with some algebra, we get:

za + yb + yzc = xyz - 0.98z - 0.7z

As such, m = z, n = y, o = yz, p = xyz - 0.98z - 0.7z. I'll leave that for you as an exercise to verify that my algebra is right. You can then form the matrix equation:

Ax = d

We would have 6 equations and we want to solve for x where x = [a b c]^{T}. To solve for x, you can employ what is known as the pseudoinverse to retrieve the parameters that best minimize the error between the true output and the output that is generated by these parameters if you were to use the same input data.

In other words:

x = A^{+}d

A^{+} is the pseudoinverse of the matrix A and is matrix-vector multiplied with the vector d.

To put our thoughts into code, we would define our input data, form the A matrix and d vector where each row shared between them both is one equation, and then employ the pseudoinverse to find our parameters. You can use the ldivide (\) operator to do the job:

%// Define x y and z
x = [9.98; 8.3; 8.0; 7; 1; 12.87];
y = [7.9; 7.5; 7.4; 6.09; 0.9; 11.23];
z = [7.1; 5.6; 5.9; 5.8; -1.8; 10.8];

%// Define A matrix
A = [z y y.*z];
%// Define d vector
d = x.*y.*z - 0.98*z - 0.7*z;

%// Find parameters via least-squares
params = A\d;

params stores the parameters a, b and c, and we get:

params =

  -37.7383
  -37.4008
   19.5625

If you want to double-check how close the values are, you can simply use the above expression in your post and compare with each of the values in x:

a = params(1); b = params(2); c = params(3);
out = (a+0.98)./y+(b+0.7)./z+c;
disp([x out])

9.9800    9.7404
8.3000    8.1077
8.0000    8.3747
7.0000    7.1989
1.0000   -0.8908
12.8700   12.8910

You can see that it's not exactly close, but the parameters you got would be the best in a least-squares error sense.


Bonus - Fitting with RANSAC

You can see that some of the predicted values (right column in the output) are more off than others. This is because we used all points in your data to find the appropriate model. One technique that is used to minimize error and increase the robustness of the model estimation is to use something called RANSAC, or RANdom SAmple Consensus. The basic methodology behind RANSAC is that for a certain number of iterations, you take your data and randomly sample the least amount of points necessary to find a model. Once you find this model, you find the overall error if you were to use these parameters to describe your data. You keep randomly choosing points, finding your model, and finding the error and the iteration that produced the least amount of error would be the parameters you keep to define the overall model.

As you can see above, one error that we can define is the sum of absolute differences between the true x points and the predicted x points. There are many other measures, such as the sum of squared errors, but let's stick with something simple for now. If you take a look at the above formulation, we need a minimum of three equations in order to define a, b and c, and so for each iteration, we'd randomly select three points without replacement I might add, find our model, determine the error, and keep iterating and finding the parameters with the least amount of error.

Therefore, you could write a RANSAC algorithm like so:

%// Define cost and number of iterations
cost = Inf;
iterations = 50;

%// Set seed for reproducibility
rng(123);

%// Define x y and z
x = [9.98; 8.3; 8.0; 7; 1; 12.87];
y = [7.9; 7.5; 7.4; 6.09; 0.9; 11.23];
z = [7.1; 5.6; 5.9; 5.8; -1.8; 10.8];

for idx = 1 : iterations
    %// Determine where we would need to sample
    ind = randperm(numel(x), 3);

    xs = x(ind); ys = y(ind); zs = z(ind); %// Sample

    %// Define A matrix
    A = [zs ys ys.*zs];
    %// Define d vector
    d = xs.*ys.*zs - 0.98*zs - 0.7*zs;

    %// Find parameters via least-squares
    params = A\d;

    %// Determine error
    a = params(1); b = params(2); c = params(3);
    out = (a+0.98)./y+(b+0.7)./z+c;
    err = sum(abs(x - out));

    %// If error produced is less than current error
    %// then save parameters
    if err < cost
        cost = err;
        final_params = params;
    end
end

When I run the above code, I get for my parameters:

final_params =

  -38.1519
  -39.1988
   19.7472

Comparing this with our x, we get:

a = final_params(1); b = final_params(2); c = final_params(3);
out = (a+0.98)./y+(b+0.7)./z+c;
disp([x out])

9.9800    9.6196
8.3000    7.9162
8.0000    8.1988
7.0000    7.0057
1.0000   -0.1667
12.8700   12.8725

As you can see, the values are improved - especially the fourth and sixth points... and compare it to the previous version:

9.9800    9.7404
8.3000    8.1077
8.0000    8.3747
7.0000    7.1989
1.0000   -0.8908
12.8700   12.8910

You can see that the second value is worse off than the previous version, but the other numbers are much more closer to the true values.


Have fun!