Linear regression with near singular matrix inversion

429 views Asked by At

I have a regression problem to estimate the slope of y = a*x+b, and tried two different methods to a. Method 1 estimates the mean of two data clusters as two points, based on which a is calculated. Method 2 uses the standard regression equation.

import numpy as np
import statistics

# find the slope a of y = a*x + b
x = "28.693756 28.850006 28.662506 28.693756 28.756256 28.662506 28.787506 \
    28.818756 28.818756 28.787506 28.787506 28.787506 28.693756 28.787506 \
    28.818756 28.725006 28.725006 28.850006 28.756256 28.725006 28.881256 \
    28.818756 28.756256 28.693756 28.756256 28.787506 28.693756 28.662506 \
    28.662506 28.787506 28.850006 28.756256 28.725006 28.818756 28.600006 \
    28.725006 28.725006 28.850006 28.881256 28.881256 28.818756 28.756256 \
    28.756256 28.787506 28.787506 28.787506 28.756256 28.787506 28.725006 \
    28.725006 28.725006 28.756256 28.818756 28.756256 28.693756 28.818756 \
    28.756256 28.756256 28.693756 28.850006 28.631256 28.693756 28.693756 \
    28.850006 28.756256 28.725006 28.693756 28.756256 28.850006 28.787506 \
    28.600006 28.631256"
x = [float(t) for t in x.split()]
y = [33.8]*36 + [38.7]*36

print(" ")
print("Method 1 ")
x1, x2 = statistics.mean(x[:36]), statistics.mean(x[36:])
y1, y2 = statistics.mean(y[:36]), statistics.mean(y[36:])
slope = (y1-y2)/(x1-x2)
print(f"a = {slope}")

print(" ")
print('Method 2')
x = np.array(x)
y = np.array(y)
X = np.c_[np.ones(x.shape), x]

XXinv = np.linalg.inv(X.transpose().dot(X)).dot(X.transpose())
_beta = XXinv.dot(y)
iv = np.linalg.inv(X.transpose().dot(X)).tolist()
print(f"a = {_beta[1]}")

xx = X.transpose().dot(X)
svd = np.linalg.svd(xx)[1]
print(f"SVD(XX) = {svd}")

Results of the code are:

Method 1
a = 1128.9599999997959

Method 2
a = 1.2136744782028899
SVD(XX) = [5.96125150e+04 3.80959618e-04]

From the data plots, the line should be close to vertically linear, and method 1 result makes more sense than method 2. Also, even the line with smallest slope across the data (shown in figure) has a slope of 17.5. For normal cases, method 2 works well. However in this case, it gives such a small slope of 1.21 which doesn't make sense.

enter image description here

The only reason I can relate to is the near singularity as shown in the SVD values. But why? or any fix?

1

There are 1 answers

2
David M. On BEST ANSWER

Your system of linear equations is overdetermined (there are more equations than unknowns) so there are no exact solutions. The solution of method 2 is a "best fit" that minimises the squared errors between predictions and actual values.

A line obtained with solution 1 visually appears to be a better fit but, mathematically speaking, does not minimise the squared errors. The reason is that some points (e.g. 28.600006, 38.7) are very far from the predicted line and this error, when squared, will significantly impact the sum of squared errors (SSE), which regression tries to minimise.

Conversely, by fitting a line "in the middle" with a slope of 1.21367, regression avoids very large errors and produces "mid-size" errors which, when squared, minimise the SSE. From a visual perspective, though, the resulting line does not appear to fit the data points as well as solution 1.