I implemented the following scikit-learn pipeline inside a file called build.py
and later, pickled it successfully.
preprocessor = ColumnTransformer(transformers=[
('target', TargetEncoder(), COL_TO_TARGET),
('one_hot', OneHotEncoder(drop_invariant=False, handle_missing='value',
handle_unknown='value', return_df=True, use_cat_names=True,
verbose=0), COL_TO_DUM),
('construction', OrdinalEncoder(mapping=mapping),['ConstructionPeriod'])
], remainder='passthrough')
test_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('std_scale', StandardScaler()),
('XGB_model',
xgb.XGBRegressor(
booster = 'gbtree', colsample_bylevel=0.75,colsample_bytree=0.75,
max_depth = 20,grow_policy = 'depthwise',learning_rate = 0.1
)
)
])
test_pipeline.fit(X_train, y_train)
import pickle
pickle.dump(open('final_pipeline.pkl','wb'), test_pipeline)
The pickled pipeline is then read in a different file app.py
, which accepts user data to make predictions via the unpickled pipeline.
pipeline = pickle.load(open('final_pipeline.pkl', 'rb'))
# data is the coming from the user via frontend
input_df = pd.DataFrame(data.dict(), index=[0])
# using the pipeline to predict
prediction = pipeline.predict(input_df)
The challenge which I am encountering is the unpickled pipeline is expecting the incoming test data to have a column structure similar to the one utilized to train the pipeline (X_train).
To solve this, I need to order the incoming test data columns to match that of X_train.
- Dirty solution, export the X_train columns names to a file and later read it inside
app.py
to rearrange the columns of the incoming test data.
Any suggestions on how to pythonically solve this?
Your column order shouldn't be important but if it is then why not just sort the column in your pipeline and then sort them in your other code file. This way you won't have to do any local storing.
df = df.reindex(sorted(df.columns), axis=1)