I'm trying to use linear regression to fit a polynomium to a set of points from a sinusoidal signal with some noise added, using linear_model.LinearRegression
from sklearn
.
As expected, the training and validation scores increase as the degree of the polynomium increases, but after some degree around 20 things start getting weird and the scores start going down, and the model returns polynomiums that don't look at all like the data that I use to train it.
Below are some plots where this can be seen, as well as the code that generated both the regression models and the plots:
How the thing works well until degree=17. Original data VS predictions:
After that it just gets worse:
Validation curve, increasing the degree of the polynomium:
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.learning_curve import validation_curve
def make_data(N, err=0.1, rseed=1):
rng = np.random.RandomState(1)
x = 10 * rng.rand(N)
X = x[:, None]
y = np.sin(x) + 0.1 * rng.randn(N)
if err > 0:
y += err * rng.randn(N)
return X, y
def PolynomialRegression(degree=4):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression())
X, y = make_data(400)
X_test = np.linspace(0, 10, 500)[:, None]
degrees = np.arange(0, 40)
plt.figure(figsize=(16, 8))
plt.scatter(X.flatten(), y)
for degree in degrees:
y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
plt.plot(X_test, y_test, label='degre={0}'.format(degree))
plt.title('Original data VS predicted values for different degrees')
plt.legend(loc='best');
degree = np.arange(0, 40)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
'polynomialfeatures__degree',
degree, cv=7)
plt.figure(figsize=(12, 6))
plt.plot(degree, np.median(train_score, 1), marker='o',
color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), marker='o',
color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.title('Learning curve, increasing the degree of the polynomium')
plt.xlabel('degree')
plt.ylabel('score');
I know the expected thing is that the validation score goes down when the complexity of the model increases, but why does the training score goes down as well? What can I be missing here?
Training score is expected to go down as well due to overfitting of the model on training data. Error on validation goes down due to sine function's Taylor series expansion. So, as you increase degree of polynomial, your model improves to fit the sine curve better.
In ideal scenario if you don't have a function that expands to infinite degrees, you see training error going down (not monotonically but in general) and validation error going up after some degree (high for lower degrees -> low for some higher degree -> increasing after that).