I'm trying to figure out how to use RFE for regression problems, and I was reading some tutorials.
I found an example on how to use RFECV to automatically select the ideal number of features, and it goes something like:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
rfecv = RFECV(estimator=RandomForestClassifier(random_state=101), step=1, cv=StratifiedKFold(10), scoring='accuracy')
rfecv.fit(X, target)
print(np.where(rfecv.support_ == False)[0])
which I find pretty straightforward.
However, I was checking how to do the same thing using a RFE object, but in order to include cross-validation I only found solutions involving the use of pipelines, like:
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# create pipeline
rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5)
model = DecisionTreeRegressor()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print(f'MAE: {mean(n_scores):.3f}')
I'm not sure about what precisely is happening here. The pipeline is used to queue the RFE algorithm and the second DecisionTreeRegressor (model). If I'm not wrong, the idea is that for every iteration in the cross-validation, the RFE is executed, the desired number of best features is selected, and then the second model is run using only those features. But how/when did the RFE pass the information about which features have been selected to the DecisionTreeRegressor? Did it even happen, or is the code missing this part?
Well, first, let's point it out that RFECV and RFE are doing two separate jobs in your script: the former is selecting the optimal number of features, while the latter is selecting the most five important features (or, the best combination of 5 features, given their importance for the DecisionTreeRegressor).
Back to your question: "When did the RFE pass the information about which features have been selected to the Decision Tree?" It is worth noting that the RFE does not explicitly tell the Decision Tree which features are selected. Simply, it takes a matrix as input (the training set) and transforms it in a matrix of N columns, based on the
n_features_to_select=N
parameter. That matrix (i.e., transformed training set) is passed as input to the Decision Tree, along with the target variable, which returns a fitted model that can be used to predict unseen instances.Let's dive into an example for classification:
We have now loaded the breast_cancer dataset and instantiated a RFE object (I used a DecisionTreeClassifier, but other algorithms can be used as well).
To see how the training data is handled within a pipeline, let's start with a manual example that show how a pipeline would works if decomposed in its "basic steps":
In the above script, we created a function that, given a dataset
X
and a target variabley
fit_transform
on the RFE, it runs the Recursive Feature Elimination, and it saves information about the selected features in its object state. To know which are the selected features, callrfe.support_
. Note: on the testing set only transform is executed, so that the features inrfe.support_
are used to filter out other features from the testing set.The
y_test
andy_pred
can be used to analyze the performance of the model, e.g., its precision. The precision in saved in an array, and the procedure is repeated 3 times. Finally, we print the average precision.We simulated a cross-validation procedure, by splitting the original data 3 times in their respective training and testing set, fitted a model, computed and averaged its performance (i.e., precision) across the three folds. This process can be simplified using a RepeatedKFold validation:
and even further with Pipeline:
In summary, when the original data is passed to the Pipeline, the latter:
RFE.fit_transform()
on the training data;RFE.transform()
on the testing data so that it consists of the same features;estimator.fit()
on the training data to fit (i.e., train) a model;estimator.predict()
on the testing data to predict it.scoring
parameter) internally.At the end of the procedure, someone can access the performance results and average them across the folds.