scikit-learn linear regression K fold cross validation

3.1k views Asked by At

I want to run Linear Regression along with K fold cross validation using sklearn library on my training data to obtain the best regression model. I then plan to use the predictor with the lowest mean error returned on my test set.

For example the below piece of code gives me an array of 20 results with different neg mean absolute errors, I am interested in finding the predictor which gives me this (least) error and then use that predictor on my test set.

sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)

1

There are 1 answers

2
Sergey Bushmanov On BEST ANSWER

There is no such thing as "predictor which gives me this (least) error" in cross_val_score, all estimators in :

sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20) 

are the same.

You may wish to check GridSearchCV that will indeed search through different sets of hyperparams and return the best estimator:

from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
X,y = datasets.make_regression()
lr_model = LinearRegression()
parameters = {'normalize':[True,False]}
clf = GridSearchCV(lr_model, parameters, refit=True, cv=5)
best_model = clf.fit(X,y)

Note the refit=True param that ensures the best model is refit on the whole dataset and returned.