SMOTE Algorithm and Classification: overrated prediction success

245 views Asked by At

I'm facing a problem about which I can't find any answer. I have a binary classification problem (output Y=0 or Y=1) with Y=1 the minority class (actually Y=1 indicates default of a company, with proportion=0.02 in the original dataframe). Therefore, I computed oversampling using SMOTE algorithm on my training set only (after splitting my dataframe in training and testing sets). I train a logistic regression on my training set (with proportions of class "defaut"=0.3) and then look at the ROC Curve and MSE to test whether my algorithm predicts well the default. I get very good results in terms both of AUC (AUC=0.89) and MSE (MSE=0.06). However, when I then try to look more preciselly and individually at my predictions, I find that 20% of default aren't well predicted. Do you have a method to evaluate well the quality of my prediction (quality means for me predictions that predict well default). I thought that AUC was a good criterium... So far do you also have a method in order to improve my regression? Thanks

1

There are 1 answers

4
RLave On BEST ANSWER

For every classification problem you can build a confusion matrix.

This is a two way entry matrix, and lets you see not only the true positives/true negatives (TP/TN), which are your correct predictions, but also the false positives (FP)/false negatives (FN), and this is most of the time your true interest.

FP and FN are the errors that your model make, you can track how well your model is doing in detecting either the TP (1-FP) or the TN (1-FN), by using sensitivity or specificity (link).

Note that you can't improve one without lowering the other. So sometimes you need to pick one.

A good compromise is the F1-score, which tries to average the two.

So if you're more interested in defaults (lets imagine that defaults=Positive Class), you'll prefer a model with a higher sensitivity. But remember to not neglect completely the specificity either.

Here an example code in R:

# to get the confusion matrix and some metrics
caret::confusionMatrix(iris$Species, sample(iris$Species))