I'm trying to get a generalized model for the Mercedes Greener Manufacturing dataset. So, I'm trying to achieve the same using XGBoost Regressor. I have used a loop ranging from 1-100 as the seed of the train test set so as to get better sampling. I have used PCA to reduce the dimensions to 8 .
How to fine-tune using xgboost so that i don't get a overfitting model?
X_train,X_test,y_train,y_test = train_test_split(X_pc,
y,
test_size=0.2,
random_state = i)
model = XGBRegressor()
model.fit(X_train,y_train)
train=model.score(X_train,y_train)
test=model.score(X_test,y_test)
output
TEST: 0.28278595203767265 TRAIN: 0.9041892366322192 RS: 0
TEST: 0.3803514386218507 TRAIN: 0.9099759411069458 RS: 1
TEST: 0.3357132066270113 TRAIN: 0.9113739827130357 RS: 2
TEST: 0.3003256802391573 TRAIN: 0.901560899846001 RS: 3
TEST: 0.3769044561739856 TRAIN: 0.9034886060173257 RS: 4
TEST: 0.3449160536081909 TRAIN: 0.9092295020552124 RS: 5
TEST: 0.43083817087609166 TRAIN: 0.8957931397175393 RS: 6
TEST: 0.27375366705147564 TRAIN: 0.912349291318306 RS: 7
TEST: 0.39315883169376264 TRAIN: 0.9090768492254802 RS: 8
TEST: 0.38714220182913905 TRAIN: 0.9089864030990132 RS: 9
TEST: 0.37089065589124093 TRAIN: 0.9099379400411342 RS: 10
TEST: 0.3785854487827084 TRAIN: 0.9080405667805768 RS: 11
TEST: 0.29249852154319345 TRAIN: 0.9057747080596891 RS: 12
TEST: 0.34881642748048425 TRAIN: 0.9077565004654295 RS: 13
The
random_state
argument is for ensuring reproducibility on the splits, so that someone else running your experiments can recreate your results.There are a number of ways to effectively train a model and reduce the chances of overfitting. One such strategy is to use Cross Fold Validation along with Grid Search to determine the best parameters for your model. Here's how that would look with your model.
To retrieve the resulting best estimator, e.g the version of the XGBoost that performed the best during training, you can do the following: