XGBoost Classifier overfitting

29 views Asked by At

I trained an XGBClassifer on a few hundred samples (around 90% of the samples belong to one class), and there is the same distribution (9:1) for train and test sets, but the model is overfitting:

# Scale data
X_train_scaled, X_test_scaled = scale_data(X_train, X_test)

# Eval metrics and early stopping
XGBC = XGBClassifier(random_state = 42, eval_metric = ['auc', 'logloss'],
                      early_stopping_rounds = 5)

# A parameter grid for XGBoost
params = { 
    'learning_rate': [ 0.1, 0.02, 0.03, 0.04], 
    'n_estimators': [300, 400, 500],
    'max_depth': [2, 3, 6, 8],
    'random_state': [42], 
    'scale_pos_weight': [9.0], # total negative/positive classes
    'lambda': [1, 5, 7],
    'alpha':[0, 5, 7],
    'max_delta_step': [0, 1, 3, 5],
    'gamma': [0, 2, 4, 5 ],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree':[0.4, 0.5],
    'colsample_bylevel':[0.4, 0.5], 
    'colsample_bynode':[0.4, 0.5],
    'min_child_weight': [3, 5, 7]
}

# Parameters to fit
fit_parms = {'eval_set': [(X_train_scaled, y_train), (X_test_scaled, y_test)],
            'verbose': False
             }

# Stratified cv
skf = StratifiedKFold(5)

# Call grid search cv
grid = GridSearchCV(
    estimator = XGBC,
    param_grid =params,
    scoring = 'roc_auc',
    n_jobs = 32,
    cv = skf.split(X_train_scaled, y_train),
    verbose = 1,
    refit=True
)

# Fit model
grid.fit(X_train_scaled, y_train, **fit_parms)


# Get the best estimator
best_model = grid.best_estimator_

# Get the evaluation results
best_evals_result = best_model.evals_result()

metrics = ['auc', 'logloss']
ylabs = ['AUC', 'log loss']
titles = ['XGBoost AUC', 'XGBoost log loss']
plt.figure(figsize=(8,3), dpi= 300)
for i in range(len(metrics)):
    plt.subplot(1, 2, i+1)
    basic_plot(yvals=best_evals_result, metric = metrics[i],
               xlab = 'Iterations', ylab = ylabs[i], title = titles[i])
    
plt.tight_layout()

enter image description here

I tried to control the parameters for overfitting, such as max_depth, scale_pos_weight, max_delta_step, and colsample_bytree, but it is not working. Here are the best parameters from the above search. Any ideas will be very helpful.

{'alpha': 5,
 'colsample_bylevel': 0.4,
 'colsample_bynode': 0.4,
 'colsample_bytree': 0.5,
 'gamma': 2,
 'lambda': 5,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 5,
 'n_estimators': 300,
 'random_state': 42,
 'scale_pos_weight': 9.0,
 'subsample': 0.8}
0

There are 0 answers