I have an imbalanced dataset that have two classes (+1
,-1
). The positives are only 7% of the dataset.
I want to classify using Desicion Trees. I have tried downsampling the negatives to:
- The same size of the positives
- The double or triple the size of the positives.
For all of them I got almost the same precision, however the recall of positives was much better for the first sample (negatives same size as positives). But I feel I'm missing something here so what is bad about this sampling??
It is fairly common to downsample a dominant class.
But you need to make sure to solve your actual problem.
If you downsample your classes to a 1:1 ratio that may make certain evaluation appear good, but does this still reflect reality? You classifier is trained to predict positive in 50% of cases, but only 3% are positive. If "false positives" cost you a lot of money, this can be a problem.