I am going to try to keep this as specific as possible but it is kind of a general question as well. I have a heavily skewed dataset in the order of { 'Class 0': 0.987, 'Class 1':0.012 }
I would like to have a set of classifiers that work well on such datasets and then create an ensemble learner of those models. I do not think I want to oversample or undersample. I definitely dont want to SMOTE because they don't scale well for high dimensional data/ or result in a very large number of data points. I want to use a cost sensitive approach to creating my classifiers and hence came across the class_weight=balanced
parameter in the scikit-learn
library. However, it doesn't seem to be helping me much because my F1 scores are still very terrible (in the range of 0.02 etc.) I have also tried using sklearn.utils.class_weight.compute_class_weight
to manually calculate the weights, store them in a dictionary and pass it as a parameter to the class_weight
parameter, however I see no improvement in F1 score and my False Positives are still very high(around 5k) and everything else quite low(less than 50). I don't understand what I am missing. Am I implementing something wrong? What else can I do to tackle my problem? When I change my evaluation metric from f1_score(average='binary')
to f1_score(average='weighted')
the F1 score increases from ~0.02 to ~98.66, which I think is probably wrong. Any kind of help including references to how I could tackle this problem will be very helpful.
I am trying to implement XGBOOST, CATBoost, LightGBM, Logistic Regression,SVC('linear'),Random Forest Classifiers
I realized that this question arose due to pure naivete. I resolved my problem by using the
imbalanced-learn
Python library. Algorithms likeimblearn.ensemble.EasyEnsembleClassifier
are a godsend when it comes to heavy imbalanced classification where the minority class is more important than the majority class. For anyone having troubles like this I suggest trying to find a different algorithm other than your usual favorites that will help you solve your problem.