Can't predict new data after training RFE and model in a pipeline

307 views Asked by At

I'm brand new to Python and machine learning and I'm surely missing something.

I'm training a RandomForest model through nested CV for hyperparameter tuning and RFECV using a pipeline. I retrieved best_estimator_.n_features and it stills shows me the 17 original features before RFECV narrowing down to 3.

X
1182 rows × 17 columns

cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
clf = RandomForestClassifier(random_state=42, n_jobs=-1, criterion='entropy', bootstrap=False)
space = {'n_estimators':  [900, 1000, 1100],
         'max_depth': [25, 50, 100],
         'min_samples_split': [500, 750, 1000],
         'min_samples_leaf': [32, 64]
        }
      
search = GridSearchCV(clf, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)
rfe = RFECV(estimator=RandomForestClassifier())
ppln = Pipeline(steps=[('rfe',rfe),('grid',search)])
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_val_score(ppln, X, y.ravel(), scoring='accuracy', cv=cv_outer, n_jobs=-1)
ppln.fit(X, y.ravel())

After I fitted pipeline I tried to predict a new data (fixt) with original 17 features. However the error message shown was: "ValueError: Number of features of the model must match the input. Model n_features is 17 and input n_features is 3."

fixtureXLS = pd.read_excel('aaafixtures.xlsx')
fixtureXLS.to_csv('bbbfixtures.csv', encoding='utf-8')
fixt = pd.read_csv('bbbfixtures.csv')
fixt = fixt.loc[:, ~fixt.columns.str.contains('^Unnamed')]
if 'Result' in fixt.columns:
fixt = fixt.drop(['Result'], axis=1)

fixt
287 rows × 17 columns
fixt['Predicted'] = ppln.predict(fixt)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-164-e54f4c6f6e05> in <module>
----> 1 temp = ppln.predict(fixt)

~\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
    117 
    118         # lambda, but not partial, allows help() to work with update_wrapper
--> 119         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    120         # update the docstring of the returned function
    121         update_wrapper(out, self.fn)

~\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
    406         for _, name, transform in self._iter(with_final=False):
    407             Xt = transform.transform(Xt)
--> 408         return self.steps[-1][-1].predict(Xt, **predict_params)
    409 
    410     @if_delegate_has_method(delegate='_final_estimator')

~\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
    117 
    118         # lambda, but not partial, allows help() to work with update_wrapper
--> 119         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    120         # update the docstring of the returned function
    121         update_wrapper(out, self.fn)

~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in predict(self, X)
    485         """
    486         self._check_is_fitted('predict')
--> 487         return self.best_estimator_.predict(X)
    488 
    489     @if_delegate_has_method(delegate=('best_estimator_', 'estimator'))

~\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict(self, X)
    627             The predicted classes.
    628         """
--> 629         proba = self.predict_proba(X)
    630 
    631         if self.n_outputs_ == 1:

~\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict_proba(self, X)
    671         check_is_fitted(self)
    672         # Check data
--> 673         X = self._validate_X_predict(X)
    674 
    675         # Assign chunk of trees to jobs

~\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in _validate_X_predict(self, X)
    419         check_is_fitted(self)
    420 
--> 421         return self.estimators_[0]._validate_X_predict(X, check_input=True)
    422 
    423     @property

~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in _validate_X_predict(self, X, check_input)
    394         n_features = X.shape[1]
    395         if self.n_features_ != n_features:
--> 396             raise ValueError("Number of features of the model must "
    397                              "match the input. Model n_features is %s and "
    398                              "input n_features is %s "

ValueError: Number of features of the model must match the input. Model n_features is 17 and input n_features is 3 

I transformed fixt to 3 features and predicted pipeline:

X_new = rfe.transform(fixt)
print(X_new.shape[1])
fixt['Predicted'] = ppln.predict(X_new)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-161-02280f45be5a> in <module>
----> 1 fixt['Predicted'] = ppln.predict(X_new)

~\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
    117 
    118         # lambda, but not partial, allows help() to work with update_wrapper
--> 119         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    120         # update the docstring of the returned function
    121         update_wrapper(out, self.fn)

~\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
    405         Xt = X
    406         for _, name, transform in self._iter(with_final=False):
--> 407             Xt = transform.transform(Xt)
    408         return self.steps[-1][-1].predict(Xt, **predict_params)
    409 

~\anaconda3\lib\site-packages\sklearn\feature_selection\_base.py in transform(self, X)
     82             return np.empty(0).reshape((X.shape[0], 0))
     83         if len(mask) != X.shape[1]:
---> 84             raise ValueError("X has a different shape than during fitting.")
     85         return X[:, safe_mask(X, mask)]
     86 

ValueError: X has a different shape than during fitting.


Can you help me sending some light, please?!

1

There are 1 answers

0
Tiago Makoto On

I don't know if there is an automated way to make it but I created a new pipeline with RandomForestClassfiers taken from the best estimator from previous pipeline, fitted and then predicted. I had to RFE it before tough.

Instead ppln.fit(X, y.ravel()) the final code was

params = search.best_estimator_.get_params()
rfc = RandomForestClassifier(**params)
ppln_new = Pipeline(steps=[('rfe',rfe),('pred',rfc)])
ppln_new.fit(X, y.ravel())
fixt['Predicted'] = ppln_new.predict(fixt)