Using pycaret's outlier and normalization features outside of model comparison

209 views Asked by At

I'm thoroughly enjoying pycaret to handle much of the legwork in my analysis. I'm making heavy use of the setup() method in preprocessing to handle normalization, target transformation, and feature selection in my data. After creating and validating my model, using the train/test sets that pycaret generates, I'm aiming to run the model on an unseen dataset to mimic a real world application. It would be nice to make use of the pycaret preprocessing to handle the legwork on the unseen dataset, just as I did for train/test.

Towards datascience has a great tutorial on analysis with pycaret here but after using a variety of transformations in the preprocessing setup method, they appear to just feed the raw data_unseen set into the predict_model() method without any obvious preparation. Is there a way to use pycaret's preprocessor on subsequent datasets that aren't train/test splits? Or do we need to do it without pycaret?

Here is their code:

import pandas as pddf = pd.read_csv('source/heart.csv')
df.head()

data = df.sample(frac=0.95, random_state=42)
data_unseen = df.drop(data.index)

data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (288, 14)
Unseen Data For Predictions: (15, 14)

from pycaret.classification import *
from imblearn.over_sampling import RandomOverSampler
model = setup(data = data, target = 'output', normalize = True, normalize_method='minmax', train_size = 0.8,fix_imbalance = True, fix_imbalance_method=RandomOverSampler(), session_id=123)

best = compare_models()
tuned_best = tune_model(best)

plot_model(tuned, plot = 'pr')

final_best = finalize_model(tuned_best)

predict_model(final_best)

predict_model(final_best, data = data_unseen)
2

There are 2 answers

2
Alper Yilmaz On

PyCaret library performs preprocessing (cleaning) of data automatically. By using the get_config() function you can get cleaned data:

PyCaret provides "pycaret.regression.get_config()" function. The get_config function retrieves the global variables created when initializing the setup function.

After using the setup() function, you can use X = get_config('X') to get the cleaned dataset (all of it) without the target variable column.

Whole dataset = X + y.

Also whole dataset = X_train + X_test + y_train + y_test.

To get the target variable column use y = get_config('y'). Similarly you can use X_train = get_config('X_train') and y_train = get_config('y_train').

Some Accessible variables are:

  • X: Transformed dataset (X)
  • y: Transformed dataset (y)
  • X_train: Transformed train dataset (X)
  • X_test: Transformed test/holdout dataset (X)
  • y_train: Transformed train dataset (y)
  • y_test: Transformed test/holdout dataset (y)

PyCaret - get_config()

Here is a better link for get_config(): pycaret.gitbook.io get_config

I think what you need is this one: dataset_transformed = get_config('dataset_transformed')

Note: You can send the whole dataset to the setup() function without splitting it. (or you can give 1.0 to the splitting parameter: train_size=1.0)

2
Alper Yilmaz On

Clarification needed: Firstly, more clarification is needed for your question, your purpose is not clear, and it is not clear which parts you want to automate (avoiding manual work).

Here is the general flow for classification: https://pycaret.gitbook.io/docs/

# Classification Functional API Example

# loading sample dataset
from pycaret.datasets import get_data
data = get_data('juice')

# init setup
s = setup(data, target = 'Purchase', session_id = 123)

# model training and selection
best = compare_models()

# evaluate trained model
evaluate_model(best)

# predict on hold-out/test set
pred_holdout = predict_model(best)

# predict on new data
new_data = data.copy().drop('Purchase', axis = 1)
predictions = predict_model(best, data = new_data)

# save model
save_model(best, 'best_pipeline')

You need to call compare_models() after setup(). Then you can use the chosen model which is named best here.

Tutorial: Also PyCaret's binary classification tutorial can be of help to you: Colab - Binary Classification

Here is the github link of this tutorial: Github - Binary Classification

For more PyCaret tutorials: PyCaret Tutorials