Optimize metrics for Fraud Detection Imbalanced Data

143 views Asked by At

I would need your help to improve my model performance. As mostly happens for fraud detection, I have an imbalanced dataset (0.1/0.9). I would like to optimize the recall for my target 1 and 0, because in one case I want to avoid fraud detection, on the other hand I want to limit the cost of targeting non-fraudulent clients as fraudulent because the 5% of the incorrect classified would decrease my revenue by €3K each (while targeting correct fraudulent would make me save 1k of loss for each customer detected).

First question I have is: what metrics would you consider based on this problem? I am more focused on recall, but I would read your opinions.

Second question: How can I improve my model performance?

So far, the best results I got without lowering the treshold is:

Accuracy: 0.89 Confusion Matrix: [[3153 279] [ 145 297]]

Classification Report: precision recall f1-score support

       0       0.96      0.92      0.94      3432
       1       0.52      0.67      0.58       442

accuracy                           0.89      3874

while if I lower the treshold to increase the recall of target 1:

Accuracy: 0.61 Confusion Matrix: [[1959 1473] [ 42 400]]

Classification Report: precision recall f1-score support

       0       0.98      0.57      0.72      3432
       1       0.21      0.90      0.35       442

accuracy                           0.61      3874

I tried several models: Linear Regression, XGBoost, Random forest and SVM

Moreover, even over/undesampling techniques (only on the train set) RandomOverSampling, RandomUnderSampling, SMOTE

Do you have any other advice?

1

There are 1 answers

2
Muhammed Yunus On

LogisticRegression would be more suited to this classification problem than LinearRegression, so it's worth a try if you haven't already.

The ROC metric summarises both recall and false positives. An ideal ROC metric of 1.0 would correspond to a scenario where you achieve both perfect recall and no false positives. sklearn has some weighted variants of the ROC metric. This provides a way of scoring a model after it has been trained.

Note that you can't use this type of metric to directly optimise the model in sklearn - you'd need to switch to PyTorch or similar and use a custom loss function.