I would need your help to improve my model performance. As mostly happens for fraud detection, I have an imbalanced dataset (0.1/0.9). I would like to optimize the recall for my target 1 and 0, because in one case I want to avoid fraud detection, on the other hand I want to limit the cost of targeting non-fraudulent clients as fraudulent because the 5% of the incorrect classified would decrease my revenue by €3K each (while targeting correct fraudulent would make me save 1k of loss for each customer detected).
First question I have is: what metrics would you consider based on this problem? I am more focused on recall, but I would read your opinions.
Second question: How can I improve my model performance?
So far, the best results I got without lowering the treshold is:
Accuracy: 0.89 Confusion Matrix: [[3153 279] [ 145 297]]
Classification Report: precision recall f1-score support
0 0.96 0.92 0.94 3432
1 0.52 0.67 0.58 442
accuracy 0.89 3874
while if I lower the treshold to increase the recall of target 1:
Accuracy: 0.61 Confusion Matrix: [[1959 1473] [ 42 400]]
Classification Report: precision recall f1-score support
0 0.98 0.57 0.72 3432
1 0.21 0.90 0.35 442
accuracy 0.61 3874
I tried several models: Linear Regression, XGBoost, Random forest and SVM
Moreover, even over/undesampling techniques (only on the train set) RandomOverSampling, RandomUnderSampling, SMOTE
Do you have any other advice?
LogisticRegression
would be more suited to this classification problem thanLinearRegression
, so it's worth a try if you haven't already.The ROC metric summarises both recall and false positives. An ideal ROC metric of 1.0 would correspond to a scenario where you achieve both perfect recall and no false positives.
sklearn
has some weighted variants of the ROC metric. This provides a way of scoring a model after it has been trained.Note that you can't use this type of metric to directly optimise the model in
sklearn
- you'd need to switch to PyTorch or similar and use a custom loss function.