How to prune a classification decision tree based on classification thresholds

62 views Asked by At

I'm using sklearn to try to train a binary classification decision tree to classify spam vs not spam. My classification threshold is 50% (i.e I'll flag it as Spam if I think there's a 50%+ chance that it is). Assume the classes aren't imbalanced.

Imagine one branch of my tree has 5000 non-spam samples and 100 spam. The tree continues to split this down further, for example leaf A has 1000 non-spam and 70 spam, leaf B has 4000 non-spam and 30 spam. This split doesn't get pruned because it significantly reduces the gini, but based on my 50% classification threshold this split doesn't actually change any predictions - everything will still be predicted as non-spam.

It feels like logically there should be some way of automatically pruning a classification tree based on a classification threshold, but other than manually inspecting the tree I can't think of how to do this and I've been unable to turn up any solutions through Google. I could decrease the max_depth or increase the min_impurity_decrease, but both of those would penalise other branches by removing useful splits.

1

There are 1 answers

1
Learning is a mess On

This split doesn't get pruned because it significantly reduces the gini, but based on my 50% classification threshold this split doesn't actually change any predictions - everything will still be predicted as non-spam.

This is incorrect. Image that a further splits your 4000 non-spam/30 spam (= 4000/30) into two branches one with 4000/0 and one with 0/30. Then the latter will cause prediction of spam at your 50% threshold. This example is excessively cherry-picked (and I cannot bother putting together synthetic data to illustrate) but you cannot rule it out; hence there is typically no termination criterium based on class ratio (for very imbalanced datasets it would not work well), and max depth of gini gains are more common thresholds.