Feature mismatch: Prediction through scikit-learn Pipeline

Question

Feature mismatch: Prediction through scikit-learn Pipeline

326 views Asked by eager_learner At 07 June 2021 at 16:14

I implemented the following scikit-learn pipeline inside a file called build.pyand later, pickled it successfully.

preprocessor = ColumnTransformer(transformers=[
        ('target', TargetEncoder(), COL_TO_TARGET),
        ('one_hot', OneHotEncoder(drop_invariant=False, handle_missing='value',
              handle_unknown='value', return_df=True, use_cat_names=True,
              verbose=0), COL_TO_DUM),
        ('construction', OrdinalEncoder(mapping=mapping),['ConstructionPeriod'])
      ], remainder='passthrough')

test_pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('std_scale', StandardScaler()),
            ('XGB_model', 
                xgb.XGBRegressor(
                    booster = 'gbtree', colsample_bylevel=0.75,colsample_bytree=0.75,
                    max_depth = 20,grow_policy = 'depthwise',learning_rate = 0.1
                 )
             )
        ])
test_pipeline.fit(X_train, y_train)

import pickle
pickle.dump(open('final_pipeline.pkl','wb'), test_pipeline)

The pickled pipeline is then read in a different file app.py, which accepts user data to make predictions via the unpickled pipeline.

pipeline = pickle.load(open('final_pipeline.pkl', 'rb'))

# data is the coming from the user via frontend
input_df = pd.DataFrame(data.dict(), index=[0])

# using the pipeline to predict 
prediction = pipeline.predict(input_df)

The challenge which I am encountering is the unpickled pipeline is expecting the incoming test data to have a column structure similar to the one utilized to train the pipeline (X_train).

To solve this, I need to order the incoming test data columns to match that of X_train.

Dirty solution, export the X_train columns names to a file and later read it inside app.py to rearrange the columns of the incoming test data.

Any suggestions on how to pythonically solve this?

Original Q&A

There are 1 answers

**secretive** · Answer 1 · 2021-06-07T17:24:19+00:00

secretive On 07 June 2021 at 17:24

Your column order shouldn't be important but if it is then why not just sort the column in your pipeline and then sort them in your other code file. This way you won't have to do any local storing.

df = df.reindex(sorted(df.columns), axis=1)

TechQA.

Feature mismatch: Prediction through scikit-learn Pipeline

There are 1 answers

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in SCIKIT-LEARN-PIPELINE

Popular Questions

Popular Tags

Trending Questions