How to extract feature names used by the first tree in GradientBoostingRegressor in scikit-learn

43 views Asked by At

I want to print the names of the features from my first estimator of GradientBoostingRegressor, but getting the below error. Scikit_learn version = 1.2.2

model.estimators_[0]._final_estimator.feature_names_in_

output:
AttributeError                            Traceback (most recent call last)
Cell In[115], line 1
----> 1 model.estimators_[0]._final_estimator.feature_names_in_

AttributeError: 'GradientBoostingRegressor' object has no attribute 'feature_names_in_'
1

There are 1 answers

0
DataJanitor On BEST ANSWER

You write that you want to specifically get the feature names of the first estimator of the ensemble. Unfortunately, the feature names of the individual trees are not stored. That's why it gives you the error

AttributeError: 'GradientBoostingRegressor' object has no attribute 'feature_names_in_'

However, since they are trained on the same set of features as the entire model, the feature names from the main GradientBoostingRegressor are available to each of its decision trees. So you can extract the feature names of the ensemble (and thus available to the first tree) like this:

model.feature_names_in_

If you are interested by the feature names used by the first tree, you can do it like this:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the dataset
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names

# Create and fit the GradientBoostingRegressor
model = GradientBoostingRegressor(max_features=0.5, random_state=0)
model.fit(X, y)  # Directly fit on X, y without converting to DataFrame

# Access the first tree of the first estimator
first_tree = model.estimators_[0, 0]

# Get the feature indices used in the first tree and filter out non-features
used_feature_indices = set([i for i in first_tree.tree_.feature if i >= 0])

# Map indices to feature names
used_feature_names = [feature_names[i] for i in used_feature_indices]

print("All feature names:", feature_names)
print("Names of features used in the first tree:", used_feature_names)
print("Names of features not used in the first tree:", set(feature_names) - set(used_feature_names))