Validation score not matching predicted score on XGBoost script

844 views Asked by At

I've been learning how to use scikit learn using the Santander customer satisfaction competition on kaggle:

https://www.kaggle.com/c/santander-customer-satisfaction

I've run grid search to tune the parameters of the XGBoost model and get a predicted roc_auc score of 0.83. When I test the winning model against the hold out set it seems that the model doesn't have any predictive power and gives as score of 0.50. I must be making an error in my script, but can't find what's going wrong and can't fathom where to look.

My training script is as follows:

import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import roc_auc_score


# reproducibility
seed = 342
np.random.seed(seed)

train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")

array = train_data.values

# Split-out validation dataset
X = array[:,0:369].astype(float)
Y = array[:,370].astype(int)
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': np.linspace(1e-16, 1, 3)
}

# params fixed
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1
}

# grid search
grid_search = GridSearchCV(
    estimator=XGBClassifier(params_fixed, seed=seed, nthread=-1),
    param_grid=params_grid,
    cv=10,
    verbose=1,
    scoring='roc_auc'
)

grid_search.fit(X_train, Y_train)

print grid_search.grid_scores_
print grid_search.best_score_
print grid_search.best_estimator_

This gives the following output (I've omitted the long list of models):

0.83303461644
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.5, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=25, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=7, silent=True, subsample=1)

Here's the script used to calculate the score on the hold out data:

import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import roc_auc_score


# reproducibility
seed = 342
np.random.seed(seed)

train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")

array = train_data.values

# Split-out validation dataset
X = array[:,0:369].astype(float)
Y = array[:,370].astype(int)
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

watchlist  = [(X_train, Y_train), (X_validation, Y_validation)]

model = XGBClassifier(
        base_score=0.5, 
        colsample_bylevel=1, 
        colsample_bytree=1,
        gamma=0, 
        learning_rate=0.5, 
        max_delta_step=0, 
        max_depth=3,
        min_child_weight=1, 
        missing=None, 
        n_estimators=25, 
        nthread=-1,
        objective='binary:logistic', 
        reg_alpha=0, 
        reg_lambda=1,
        scale_pos_weight=1, 
        seed=7, 
        silent=True, 
        subsample=1
)

model.fit(X_train, Y_train, eval_metric="auc", eval_set=watchlist, verbose=True)

model.fit(X_train, Y_train)

predictions = model.predict(X_validation)

print roc_auc_score(Y_validation, predictions)

This outputs 0.503777444213, which is contrary to expectations which are that is should output a lower score, but one that is considerably closer to 0.83.

Can anyone spot where I'm going wrong?

Update following suggestion to plot learning curves

Plotting the learning curves (assuming I've interpreted correctly) produces the following chart:

Train & Test learning curves Values are from a watchlist defined as per above, I've edited the code where I've added this in. I've omitted the code for the plot.

From what I can tell this suggests that over-fitting isn't to blame, and I suspect the error is in the way I have generated the calculation of the validation in the first instance, but that's just my feeling, I don't yet fully understand what I'm working with.

For completeness, here's what I used to plot the learning curves, for better or worse. I've realised some of the labels should have been re-written from where I found the code:

epochs = len(results['validation_0']['auc'])
x_axis = range(0, epochs)
y_axis = range(0, 1)
# plot log loss
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Test')
ax.legend()
pyplot.ylabel('auc')
pyplot.show()
# plot classification error
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Test')
ax.legend()
pyplot.ylabel('Classification Error')
pyplot.title('XGBoost Classification Error')
pyplot.show()
0

There are 0 answers