Apologies if this is obvious but I couldn't find a clear answer to this:
Say I've used a pretty typical pipeline:
feat_sel = RandomizedLogisticRegression()
clf = RandomForestClassifier()
pl = Pipeline([ ('preprocessing', preprocessing.StandardScaler()),
('feature_selection', feat_sel),
('classification', clf)])
pl.fit(X,y)
Now when I apply pl on a new set,
pl.predict(X_classify);
is RandomizedLogisticRegression going to be reapplied or are the columns that were selected in training going to be used in the new data? If not is there a way for pipeline to differentiate between feature selectors and feature extractors/scalers/other transforms that should be applied on the new input? Until I'm sure, I'm skipping the pipeline feature and just doing each step manually and maintaning state.
Thanks!
The pipeline calls
transform
on the preprocessing and feature selection steps if you callpl.predict
. That means that the features selected in training will be selected from the test data (the only thing that makes sense here).It is unclear what you mean by "apply" here. Nothing new will be learned when calling "predict", but all steps will be used with "transform".