I am trying to code a multiple linear regression problem using two different methods. One is the simple one as stated below:

from sklearn.model_selection import train_test_split
X = df[['geo','age','v_age']]
y = df['freq']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))

The above code gives me an MSE of 0.46 and a Y2 score of '0.0012' which is really bad fit. Meanwhile when I use:

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=1) #Degree = 1 should give the same equation as above code block
X_ = poly.fit_transform(X)
y = y.values.reshape(-1, 1)
predict_ = poly.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X_, predict_, test_size=0.33, random_state=42)

# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))

Using PolynomialFeatures gives me an MSE of 0.23 and a Y2 score of '0.5' which is much much better. I don't understand how two methods using the same regression equation give such different answers. Rest everything else is the same.

0

There are 0 answers