I'm thoroughly enjoying pycaret to handle much of the legwork in my analysis. I'm making heavy use of the setup()
method in preprocessing to handle normalization, target transformation, and feature selection in my data. After creating and validating my model, using the train/test sets that pycaret generates, I'm aiming to run the model on an unseen dataset to mimic a real world application. It would be nice to make use of the pycaret preprocessing to handle the legwork on the unseen dataset, just as I did for train/test.
Towards datascience has a great tutorial on analysis with pycaret here but after using a variety of transformations in the preprocessing setup method, they appear to just feed the raw data_unseen
set into the predict_model()
method without any obvious preparation. Is there a way to use pycaret's preprocessor on subsequent datasets that aren't train/test splits? Or do we need to do it without pycaret?
Here is their code:
import pandas as pddf = pd.read_csv('source/heart.csv')
df.head()
data = df.sample(frac=0.95, random_state=42)
data_unseen = df.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (288, 14)
Unseen Data For Predictions: (15, 14)
from pycaret.classification import *
from imblearn.over_sampling import RandomOverSampler
model = setup(data = data, target = 'output', normalize = True, normalize_method='minmax', train_size = 0.8,fix_imbalance = True, fix_imbalance_method=RandomOverSampler(), session_id=123)
best = compare_models()
tuned_best = tune_model(best)
plot_model(tuned, plot = 'pr')
final_best = finalize_model(tuned_best)
predict_model(final_best)
predict_model(final_best, data = data_unseen)
PyCaret library performs preprocessing (cleaning) of data automatically. By using the
get_config()
function you can get cleaned data:After using the
setup()
function, you can useX = get_config('X')
to get the cleaned dataset (all of it) without the target variable column.Whole dataset = X + y.
Also whole dataset = X_train + X_test + y_train + y_test.
To get the target variable column use
y = get_config('y')
. Similarly you can useX_train = get_config('X_train')
andy_train = get_config('y_train')
.Some Accessible variables are:
PyCaret - get_config()
Here is a better link for
get_config()
: pycaret.gitbook.io get_configI think what you need is this one:
dataset_transformed = get_config('dataset_transformed')
Note: You can send the whole dataset to the
setup()
function without splitting it. (or you can give 1.0 to the splitting parameter:train_size=1.0
)