Apply MinMaxScaler() to RFECV() with a pipeline

464 views Asked by At

I'm trying to do feature selection and I'm using RFECV for it and LogisticRegression. To do this, I need to scale the data because the regression will not converge otherwise. However, I think if I scaled the full data first it would be biased (basically data is leaking to the test set).

This is my code so far:

from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
cv = StratifiedKFold(5)
scaler = MinMaxScaler()
reg = LogisticRegression(max_iter=1000, solver="newton-cg")
pipeline = Pipeline(steps=[("scale",scaler),("lr",reg)])
visualizer = RFECV(pipeline, cv=cv, scoring='f1_weighted')

but it gives me this error:

Traceback (most recent call last):
  File "<ipython-input-267-0073ead26d52>", line 1, in <module>
    visualizer.fit(x_6, y_6)        # Fit the data to the visualizer
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_selection\_rfe.py", line 550, in fit
    scores = parallel(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_selection\_rfe.py", line 551, in <genexpr>
    func(rfe, self.estimator, X, y, train, test, scorer)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_selection\_rfe.py", line 33, in _rfe_single_fit
    return rfe._fit(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_selection\_rfe.py", line 204, in _fit
    raise RuntimeError('The classifier does not expose '
RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes

And I haven't even fit it to the data yet.

I tried searching but I couldn't find anything useful. Any ideas what might be failing?

1

There are 1 answers

1
afsharov On

This is a quite frequent issue with Pipeline objects. They do not expose intrinsic feature importance measures and other attributes of fitted estimators by default. So you have to define a custom pipeline object that does.

This answer here has already provided a solution that exposes feature importance measures:

class MyPipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

Using this, you would create your pipeline object like:

pipeline = MyPipeline(steps=[("scale",scaler),("lr",reg)])

Now the RFECV object can access the coefficients of the fitted LogisticRegression model with no issues.