create decision tree from data

1.3k views Asked by At

I'm trying to create decision tree from data. I'm using the tree for guess-the-animal-game kind of application. User answers questions with yes/no and program guesses the answer. This program is for homework.

I don't know how to create decision tree from data. I have no way of knowing what will be the root node. Data will be different every time. I can't do it by hand. My data is like this:

Animal1: property1, property3, property5
Animal2: property2, property3, property5, property6
Animal3: property1, property6
etc.

I searched stackoverflow and i found ID3 and C4.5 algorithms. But i don't know if i should use them.

Can someone direct me, what algorithm should i use, to build decision tree in this situation?

1

There are 1 answers

2
amit On BEST ANSWER

I searched stackoverflow and i found ID3 and C4.5 algorithms. But i don't know if i should use them.

Yes, you should. They are very commonly used decision trees, and have some nice open source implementations for them. (Weka's J48 is an example implementation of C4.5)

If you need to implement something from scratch, implementing a simple decision tree is fairly simple, and is done iteratively:

  1. Let the set of labled samples be S, with set of properties P={p1,p2,...,pk}
  2. Choose a property pi
  3. Split S to two sets S1,S2 - S1 holds pi, and S2 do not. Create two children for the current node, and move S1 and S2 to them respectively
  4. Repeat for S'=S1, S'=S2 for each of the subsets of samples, if they are not empty.

Some pointers:

  • At each iteration you basically split the current data to 2 subsets, the samples that hold pi, and the data that does not. You then create two new nodes, which are the current node's children, and repeat the process for each of them, each with the relevant subset of data.
  • A smart algorithm chooses the property pi (in step 2) in a way that minimizes the tree's height as much as it can (finding the best solution is NP-Hard, but there are greedy approaches to minimize entropy, for example).
  • After the tree is created, some pruning to it is done, in order to avoid overfitting.
  • A simple extension of this algorithm is using multiple decision trees that work seperately - this is called Random Forests, and is empirically getting pretty good results usually.