Imbalanced data, regression tree and SMOTE oversampling

324 views Asked by At

I am trying to build a binary classification tree with the rpart package in R on a dataset but the overall accuracy achieved on the model is way too high (99.8%?) and the tree is huge with many splits.

Will this be an indication of an overfitted model? Minimal cost complexity pruning did not cause the pruned tree to be much different from the fully grown tree at cp=0.

If yes, is this an indication that the dataset could be imbalanced and hence I should oversample the minority class(~15%) using SMOTE?

Then again, how can one determine from the results of a CART model if the dataset is imbalanced?

Finally, is it safe to say that a reduction in the size of the dataset is a reasonable sacrifice to make when it comes to using SMOTE to balance an imbalanced dataset?

Sorry for the many questions and thank you so much for your assistance.

0

There are 0 answers