building classification tree having categorical variables using rpart

29.4k views Asked by At

I have a data set with 14 features and few of them are as below, where sex and marital status are categorical variables.

height,sex,maritalStatus,age,edu,homeType

SEX
         1. Male
         2. Female

MARITAL STATUS
         1. Married
         2. Living together, not married
         3. Divorced or separated
         4. Widowed
         5. Single, never married

Now I am using rpart library from R to build a classification tree using the following

rfit = rpart(homeType ~., data = trainingData, method = "class", cp = 0.0001)

This gives me a decision tree that does not consider sex and marital status as factors.

I am thinking of using as.factor for this :

sex = as.factor(trainingData$sex)
ms = as.factor(trainingData$maritalStatus)

But I am not sure how do i pass this information to rpart. Since the data argument in rpart() takes in "trainingData" data frame. It will always take the values that are in this data frame. I am little new to R and would appreciate someone's help on this.

2

There are 2 answers

1
Jean V. Adams On BEST ANSWER

You could make the changes to the trainingData data frame directly, then run rpart().

trainingData$sex = as.factor(trainingData$sex)
trainingData$maritalStatus = as.factor(trainingData$maritalStatus)
rfit = rpart(homeType ~., data = trainingData, method = "class", cp = 0.0001)
0
Jose Carlos Machicao Valencia On

In practice you can transform any categorical value into an ordinal value, for instance 'Marital Status' into conditions 1, 2, 3... But, in general you shouldn't make the transformation unless you have a conceptual definition of any continuous value. For example, if you cannot define what is a 1.2 Martital Status, you shouldn't make the transformation. Instead, sometimes you can use a representative value, depending on the objective of your research. For instance, if you are trying to link your data to predict the type of home, the 'minimum degree of comfort' of each marital status is an ordinal value that is able to be interpreted if (let's say) is 1.2.