What is the correct way to use standardization/normalization in combination with K-Fold Cross Validation?

2.6k views Asked by At

I have always learned that standardization or normalization should be fit only on the training set, and then be used to transform the test set. So what I'd do is:

scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)

Now if I were to use this model on new data I could just save 'scaler' and load it to any new script.

I'm having trouble though understanding how this works for K-fold CV. Is it best practice to re-fit and transform the scaler on every fold? I could understand how this works on building the model, but what if I want to use this model later on. Which scaler should I save?

Further I want to extend this to time-series data. I understand how k-fold works for time-series, but again how do I combine this with CV? In this case I would suggest saving the very last scaler as this would be fit on 4/5th (In case of k=5) of the data, having it fit on the most (recent) data. Would that be the correct approach?

1

There are 1 answers

0
C8H10N4O2 On BEST ANSWER

Is it best practice to re-fit and transform the scaler on every fold?

Yes. You might want to read scikit-learn's doc on cross-validation:

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction.

Which scaler should I save?

Save the scaler (and any other preprocessing, i.e. a pipeline) and the predictor trained on all of your training data, not just (k-1)/k of it from cross-validation or 70% from a single split.

  • If you're doing a regression model, it's that simple.

  • If your model training requires hyperparameter search using cross-validation (e.g., grid search for xgboost learning parameters), then you have already gathered information from across folds, so you need another test set to estimate true out-of-sample model performance. (Once you have made this estimation, you can retrain yet again on combined train+test data. This final step is not always done for neural networks that are parameterized for a particular sample size.)