Feature Importance using Imbalanced-learn library

3.2k views Asked by At

The imblearn library is a library used for unbalanced classifications. It allows you to use scikit-learn estimators while balancing the classes using a variety of methods, from undersampling to oversampling to ensembles.

My question is however, how can I get feature improtance of the estimator after using BalancedBaggingClassifier or any other sampling method from imblearn?

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from imblearn.ensemble import BalancedBaggingClassifier 
from sklearn.tree import DecisionTreeClassifier
X, y = make_classification(n_classes=2, class_sep=2,weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape {}'.format(Counter(y)))
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)
bbc = BalancedBaggingClassifier(random_state=42,base_estimator=DecisionTreeClassifier(criterion=criteria_,max_features='sqrt',random_state=1),n_estimators=2000)
bbc.fit(X_train,y_train) 
3

There are 3 answers

2
Jeremy McGibbon On BEST ANSWER

Not all estimators in sklearn allow you to get feature importances (for example, BaggingClassifier doesn't). If the estimator does, it looks like it should just be stored as estimator.feature_importances_, since the imblearn package subclasses from sklearn classes. I don't know what estimators imblearn has implemented, so I don't know if there are any that provide feature_importances_, but in general you should look at the sklearn documentation for the corresponding object to see if it does.

You can, in this case, look at the feature importances for each of the estimators within the BalancedBaggingClassifier, like this:

for estimator in bbc.estimators_:
    print(estimator.steps[1][1].feature_importances_)

And you can print the mean importance across the estimators like this:

print(np.mean([est.steps[1][1].feature_importances_ for est in bbc.estimators_], axis=0))
0
mamafoku On

There is a shortcut around this, however it is not very efficient. The BalancedBaggingClassifieruses the RandomUnderSampler successively and fits the estimator on top. A for-loop with RandomUnderSampler can be one way of going around the pipeline method, and then call the Scikit-learn estimator directly. This will also allow to look at feature_importance:

from imblearn.under_sampling import RandomUnderSampler
rus=RandomUnderSampler(random_state=1)

my_list=[]
for i in range(0,10): #random under sampling 10 times
    X_pl,y_pl=rus.sample(X_train,y_train,)
    my_list.append((X_pl,y_pl)) #forming tuples from samples

X_pl=[]
Y_pl=[]
for num in range(0,len(my_list)): #Creating the dataframes for input/output
    X_pl.append(pd.DataFrame(my_list[num][0]))
    Y_pl.append(pd.DataFrame(my_list[num][1]))

X_pl_=pd.concat(X_pl) #Concatenating the DataFrames
Y_pl_=pd.concat(Y_pl)

RF=RandomForestClassifier(n_estimators=2000,criterion='gini',max_features=25,random_state=1)
RF.fit(X_pl_,Y_pl_) 
RF.feature_importances_
0
Raymond Reddington On

According to scikit learn documentation, you can use impurity-based feature importance on classifications, that don't have their own using some sort of ForestClassifier. Here my classifier doesn't have feature_importances_, I'm adding it directly.

classifier.fit(x_train, y_train)

...
...

forest = ExtraTreesClassifier(n_estimators=classifier.n_estimators,
                              random_state=classifier.random_state)

forest.fit(x_train, y_train)
classifier.feature_importances_ = forest.feature_importances_