How to set important features as attribute on XGBRegressor and save as part of json while saving the model

57 views Asked by At

I have trained a XGBRegressor model. Now, I am trying to save the important features as attribute on the model and want that the attribute gets saved/restored along with the model.

I have 2 issues here -

1.

regressor.fit(X=X_train, y=y_train, eval_set=[(X_train, y_train), (X_validation, y_validation)], verbose=False)

feature_importance: List[Tuple[str, float]] = sorted(
            regressor.get_booster().get_score(importance_type="gain").items(), key=lambda x: x[1]
        )
selected_features: List[str] = [x[0] for x in feature_importance if x[1] > 0]
setattr(regressor, "selected_features", selected_features)

The setattr and corresponding getattr is giving me lint warnings (B010 and B009) - is there better way to do this to avoid those warnings?

The getattr usage is something like this -

def get_model_features(model: XGBRegressor) -> List[str] | None:
   return getattr(model, "selected_features") if (model is not None and isinstance(model, XGBRegressor) else None
 
  1. The attribute does not get saved in the json file. I am using following call to save -

regressor.save_model(fname="model.json")

How to accomplish this? I want to avoid pickle save/restore.

2

There are 2 answers

4
user1808924 On BEST ANSWER

The attribute does not get saved in the json file

This is the expected behaviour.

The XGBRegressor.save_model(fname) method call simply "redirects" to the Booster.save_model(fname) method call. Any attributes that were defined in the top-most Scikit-Learn layer (such as custom feature importance attributes) will not be propagated along.

The underlying XGBoost model saver/loader (via JSON/UBJSON) does not contain any logic for maintaining custom model metadata. Ony real model data, which is actually used by XGBoost itself.

If you want to save Scikit-Learn wrappers with custom attributes, then you must keep using the pickle data format. No way around there.

0
soumeng78 On

Based on inputs from @user1808924 and input from a fellow colleague, I finally did something similar to following:

regressor.fit(X=X_train, y=y_train, eval_set=[(X_train, y_train), (X_validation, y_validation)], verbose=False)

feature_importance: List[Tuple[str, float]] = sorted(
            regressor.get_booster().get_score(importance_type="gain").items(), key=lambda x: x[1]
        )
selected_features: List[str] = [x[0] for x in feature_importance if x[1] > 0]

model_data: Dict[str, Any] = {
    "model": base64.b64encode(pickle.dumps(regressor)).decode('utf-8'),
    "features": selected_features,
}


with open('XGBModel.json', 'w') as json_file: 
    json.dump(model_data, json_file)