I understand that decision trees try to put classifiers with high entropy high on the decision tree. However, how does information gain play into this?
Information gain is defined as:
InformationGain = EntropyBefore - EntropyAfter
Does a decision tree try to place classifiers with low information gain at the top of the tree? So entropy is always maximized and information gain is always minimized?
Sorry, I am just a bit confused. Thank you!
It is the opposite. For a decision tree that uses Information Gain, the algorithm chooses the attribute that provides the greatest Information Gain (this is also the attribute that causes the greatest reduction in entropy).
Consider a simple two-class problem where you have an equal number of training observations from classes C_1 and C_2. In this case, you are starting with an entropy of 1.0 (because the probability of randomly drawing either class from the sample is 0.5). Now consider attribute A, which has values A_1 and A_2. Suppose also that A_1 and A_2 both correspond to equal probabilities (0.5) for the two classes:
There is no change in overall entropy for this attribute and therefore, the Information Gain is 0. Now consider attribute B, which has values B_1 and B_2 and suppose B will perfectly separate the classes:
Since B perfectly separates the classes, the entropy after splitting on B is 0 (i.e., the Information Gain is 1). So for this example, you would select attribute B as the root node (and there would be no need to select additional attributes since the data are already perfectly classified by B).
Decision Tree algorithms are "greedy" in the sense that they always select the attribute the yields the greatest Information Gain for the current node (branch) being considered without later reconsidering the attribute after subsequent child branches are added. So to answer your second question: the decision tree algorithm tries to place attributes with greatest Information Gain near the base of the tree. Note that due to the algorithm's greedy behavior, the decision tree algorithm will not necessarily yield a tree that provides the maximum possible overall reduction in entropy.