How to prune a classification decision tree based on classification thresholds

Question

How to prune a classification decision tree based on classification thresholds

62 views Asked by Apollo At 03 July 2023 at 11:10

I'm using sklearn to try to train a binary classification decision tree to classify spam vs not spam. My classification threshold is 50% (i.e I'll flag it as Spam if I think there's a 50%+ chance that it is). Assume the classes aren't imbalanced.

Imagine one branch of my tree has 5000 non-spam samples and 100 spam. The tree continues to split this down further, for example leaf A has 1000 non-spam and 70 spam, leaf B has 4000 non-spam and 30 spam. This split doesn't get pruned because it significantly reduces the gini, but based on my 50% classification threshold this split doesn't actually change any predictions - everything will still be predicted as non-spam.

It feels like logically there should be some way of automatically pruning a classification tree based on a classification threshold, but other than manually inspecting the tree I can't think of how to do this and I've been unable to turn up any solutions through Google. I could decrease the max_depth or increase the min_impurity_decrease, but both of those would penalise other branches by removing useful splits.

Original Q&A

There are 1 answers

**Learning is a mess** · Answer 1 · 2023-07-03T11:18:01+00:00

This split doesn't get pruned because it significantly reduces the gini, but based on my 50% classification threshold this split doesn't actually change any predictions - everything will still be predicted as non-spam.

This is incorrect. Image that a further splits your 4000 non-spam/30 spam (= 4000/30) into two branches one with 4000/0 and one with 0/30. Then the latter will cause prediction of spam at your 50% threshold. This example is excessively cherry-picked (and I cannot bother putting together synthetic data to illustrate) but you cannot rule it out; hence there is typically no termination criterium based on class ratio (for very imbalanced datasets it would not work well), and max depth of gini gains are more common thresholds.

TechQA.

How to prune a classification decision tree based on classification thresholds

There are 1 answers

Related Questions in SCIKIT-LEARN

Related Questions in DATA-SCIENCE

Related Questions in CLASSIFICATION

Related Questions in DECISION-TREE

Related Questions in PRUNING

Popular Questions

Trending Questions