I'm using the RFECV module in sklearn to find the optimal number of features to yield the highest Cross validation on 2 folds. I am using a ridge regressor as my estimator.
rfecv = RFECV(estimator=ridge,step=1, cv=KFold(n_splits=2))
rfecv.fit(df, y)
I have 5 features in my dataset that I have standardized using the standardscaler.
I'll run the RFECV on my data, and it'll say that 2 features is optimal. But when I remove one of the features with the lowest regression coefficient and rerun the RFECV, it now says that 3 features is optimal.
When I progress through all features one at a time (as the recursive should do) I find that 3 is in fact the optimal.
I've tested this with other datasets, and have found that the optimal number of features changes as I remove features one at a time and rerun RFECV.
I might be missing something, but isn't that what RFECV is supposed to solve? Any additional insights on RFECV is appreciated.
This makes sense actually. RFECV is recommending a certain number of features based on the available data. When you remove the feature you change the scoring range.
from the docs:
...
n_features_to_select
is used to determine how many features should be used in RFE for any particular iteration (within/under-the-hood of RFECV).And so this is directly connected to the number of features you include in your initial rfecv.fit() step.
Also, removing the feature with the lowest regression coefficient is not the best way to trim features. The coefficient is a reflection of its impact on the dependent variable not necessarily the model's accuracy.