Are decision trees trying to maximize information gain or entropy?

8.3k views Asked by At

I understand that decision trees try to put classifiers with high entropy high on the decision tree. However, how does information gain play into this?

Information gain is defined as:

InformationGain = EntropyBefore - EntropyAfter

Does a decision tree try to place classifiers with low information gain at the top of the tree? So entropy is always maximized and information gain is always minimized?

Sorry, I am just a bit confused. Thank you!

2

There are 2 answers

1
bogatron On

It is the opposite. For a decision tree that uses Information Gain, the algorithm chooses the attribute that provides the greatest Information Gain (this is also the attribute that causes the greatest reduction in entropy).

Consider a simple two-class problem where you have an equal number of training observations from classes C_1 and C_2. In this case, you are starting with an entropy of 1.0 (because the probability of randomly drawing either class from the sample is 0.5). Now consider attribute A, which has values A_1 and A_2. Suppose also that A_1 and A_2 both correspond to equal probabilities (0.5) for the two classes:

P(C_1|A_1) = 0.5
P(C_2|A_1) = 0.5
P(C_1|A_2) = 0.5
P(C_2|A_2) = 0.5

There is no change in overall entropy for this attribute and therefore, the Information Gain is 0. Now consider attribute B, which has values B_1 and B_2 and suppose B will perfectly separate the classes:

P(C_1|B_1) = 0
P(C_2|B_1) = 1
P(C_1|B_2) = 1
P(C_2|B_2) = 0

Since B perfectly separates the classes, the entropy after splitting on B is 0 (i.e., the Information Gain is 1). So for this example, you would select attribute B as the root node (and there would be no need to select additional attributes since the data are already perfectly classified by B).

Decision Tree algorithms are "greedy" in the sense that they always select the attribute the yields the greatest Information Gain for the current node (branch) being considered without later reconsidering the attribute after subsequent child branches are added. So to answer your second question: the decision tree algorithm tries to place attributes with greatest Information Gain near the base of the tree. Note that due to the algorithm's greedy behavior, the decision tree algorithm will not necessarily yield a tree that provides the maximum possible overall reduction in entropy.

0
Novak On

No, you are always placing nodes with high information gain at the top of the tree. But remember, this is a recursive algorithm.

If you have a table with (say) five attributes, then you must first calculate the information of each of those five attribute and choose the attribute with the highest information gain. At this point, you should think of your developing decision tree as having a root with that highest node, and children with sub-tables taken from the values of that attribute. E.g., if it is a binary attribute, you will have two children; each will have the four remaining attributes; one child will have all the elements corresponding to the selection attribute being true, the other all the elements corresponding to to false.

Now for each of those children nodes, you go through and select the attribute of highest information gain again, recursively until you cannot continue.

In this way, you have a tree which is always telling you to make a decision based on a variable that is expected to give you the most information while taking into account decisions you have already made.