I have an anomaly detection problem with a big difference between healthy and anomalous data (i.e. >20.000 healthy datapoints against <30 anomalies).

Currently, I use just precision, recall and f1 score to measure the performance of my model. But I have no good method to set the threshold parameter. But that is not the problem at the moment.

I want to measure if the model is able to distinguish between the two classes independent of the threshold. I have read, that the ROC-AUC measure can be used if the data is unbalanced (https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-2-86d5649a5428). But with my data I get very high ROC-AUC scores (>0.97), even if the model outputs low scores if an anomaly occurs.

Maybe someone knows a better performance measure for this task or should I stick with the ROC-AUC score?

I try to add an example for my problem:

We consider a case where we have 20448 data points. We have 26 anomalies in this data. With my model I get the following anomaly scores for this anomalies:

[1.26146367, 1.90735495, 3.08136725, 1.35184909, 2.45533306,
   2.27591039, 2.5894709 , 1.8333928 , 2.19098432, 1.64351134,
   1.38457746, 1.87627623, 3.06143893, 2.95044859, 1.35565042,
   2.26926566, 1.59751463, 3.1462369 , 1.6684134 , 3.02167491,
   3.14508974, 1.0376038 , 1.86455995, 1.61870919, 1.35576177,

If I now output how many data points have a higher anomaly score as, for example 1.38457746, I get 281 data points. That look like a bad performance from my perspective. But at the end the ROC AUC score is still 0.976038.

len(np.where(scores > 1.38457746)[0]) # 281

0 Answers