Toolset Versions:
python: 3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]
pycaret: 3.1.0
First post - please be gentle if I don't provide all the info needed. ;)
I'm using pycaret to perform a binary classification. Here is my setup:
exp = ClassificationExperiment()
exp.setup(filtered_train, target = 'diseaseweek', fix_imbalance=True, log_experiment = True,
experiment_name = 'exp1 - full feature set, PCA',
normalize=True, remove_multicollinearity=True, pca=True, pca_components=0.95, session_id = 123)
And here are the results of "compare_models." I've left the holdout split as the PyCaret default.
This is the resulting ROC curve:
I finalize the model: Model finalization
And get this resulting ROC curve: ROC curve after finalization
Any guidance on the levers I can pull to reduce the overfitting I'm seeing? Thanks!
I ran a PyCaret classification model using the default 10-fold stratified cross validation, and the default 70:30 holdout split and saw good cross validation performance. But when I finalized the model across the entire dataset, the performance of the final model is greatly reduced. I'm not sure what actions I can take to reduce the overfitting.
I should add that I get very similar results with a tuned version of the same model. (Tuned on 50 iterations.)