Feature mismatch: Prediction through scikit-learn Pipeline

325 views Asked by At

I implemented the following scikit-learn pipeline inside a file called build.pyand later, pickled it successfully.

preprocessor = ColumnTransformer(transformers=[
        ('target', TargetEncoder(), COL_TO_TARGET),
        ('one_hot', OneHotEncoder(drop_invariant=False, handle_missing='value',
              handle_unknown='value', return_df=True, use_cat_names=True,
              verbose=0), COL_TO_DUM),
        ('construction', OrdinalEncoder(mapping=mapping),['ConstructionPeriod'])
      ], remainder='passthrough')

test_pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('std_scale', StandardScaler()),
            ('XGB_model', 
                xgb.XGBRegressor(
                    booster = 'gbtree', colsample_bylevel=0.75,colsample_bytree=0.75,
                    max_depth = 20,grow_policy = 'depthwise',learning_rate = 0.1
                 )
             )
        ])
test_pipeline.fit(X_train, y_train)

import pickle
pickle.dump(open('final_pipeline.pkl','wb'), test_pipeline)

The pickled pipeline is then read in a different file app.py, which accepts user data to make predictions via the unpickled pipeline.

pipeline = pickle.load(open('final_pipeline.pkl', 'rb'))

# data is the coming from the user via frontend
input_df = pd.DataFrame(data.dict(), index=[0])

# using the pipeline to predict 
prediction = pipeline.predict(input_df)

The challenge which I am encountering is the unpickled pipeline is expecting the incoming test data to have a column structure similar to the one utilized to train the pipeline (X_train). Feature Error

To solve this, I need to order the incoming test data columns to match that of X_train.

  • Dirty solution, export the X_train columns names to a file and later read it inside app.py to rearrange the columns of the incoming test data.

Any suggestions on how to pythonically solve this?

1

There are 1 answers

1
secretive On

Your column order shouldn't be important but if it is then why not just sort the column in your pipeline and then sort them in your other code file. This way you won't have to do any local storing.

df = df.reindex(sorted(df.columns), axis=1)