Should I use training or validation set for parameter otimization?

43 views Asked by At

I am training a model using Decision Tree and parameter otimization.

I read that the objective of the validation set is to assess model performance during training and help tune parameters.

With this in my mind shouldn't I be using the validation set on grid_search.fit instead of using my training set?

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")

#Validation
best_clf = grid_search.best_estimator_
val_accuracy = best_clf.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)
print("\n")

#Test
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")
1

There are 1 answers

0
Alexander Kalian On

According to the scikit-learn documentation for GridSearchCV(), the data you feed into the function is automatically split into folds and cross-validation is performed. You should hence just provide the full dataset (minus the final training data) and not worry about splitting the data yourself.

To do this, you may wish to combine the training and validation dataset:

import numpy as np

# Merge the training and validation datasets, for use in the GridSearchCV() function.
X_opt = np.vstack((X_train, X_val))
y_opt = np.hstack((y_train, y_val))

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')  # This uses 5-fold cross-validation.
grid_search.fit(X_opt, y_opt)  # Fit to the merged datasets.
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")

Your script optimises your model, by training it on ~80% of the fitted data, with the remaining ~20% assigned as validation data. This is then changed around, over different folds. By using the code as modified above, you can ensure that you are making full use of the training and validation data you have, while still avoiding optimisation against the testing data.

You are correct in your understanding that optimisation is ideally performed against validation data - training however always must remain on the training dataset. The GridSearchCV() function essentially does this, but using k-fold cross-validation.

You would then want to analyse the grid search, using the actual folds it processed:

# Analyse grid search.
best_clf = grid_search.best_estimator_
results = grid_search.cv_results_  # Access fold-specific results.
num_folds = grid_search.cv  # Automatically find number of folds.

# Initialise a list to hold best fold-specific scores.
best_scores_per_fold = [float("-inf")] * num_folds

# Iterate over each fold.
for i in range(num_folds):
    fold_key = f"split{i}_test_score"
    # Find the best score for this fold.
    best_score_for_fold = np.max(results[fold_key])
    best_scores_per_fold[i] = best_score_for_fold

# Print the best scores per fold.
for i, score in enumerate(best_scores_per_fold, 1):
    print(f"Best score for fold {i}: {score}")

# Test your model, on the testing data.
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")