Search for corresponding node in a regression tree using rpart

4.3k views Asked by At

I'm pretty new to R and I'm stuck with a pretty dumb problem.

I'm calibrating a regression tree using the rpart package in order to do some classification and some forecasting.

Thanks to R the calibration part is easy to do and easy to control.

#the package rpart is needed
library(rpart)

# Loading of a big data file used for calibration
my_data <- read.csv("my_file.csv", sep=",", header=TRUE)

# Regression tree calibration
tree <- rpart(Ratio ~ Attribute1 + Attribute2 + Attribute3 + 
                      Attribute4 + Attribute5, 
                      method="anova", data=my_data, 
                      control=rpart.control(minsplit=100, cp=0.0001))

After having calibrated a big decision tree, I want, for a given data sample to find the corresponding cluster of some new data (and thus the forecasted value).
The predict function seems to be perfect for the need.

# read validation data
validationData <-read.csv("my_sample.csv", sep=",", header=TRUE)

# search for the probability in the tree
predict <- predict(tree, newdata=validationData, class="prob")

# dump them in a file
write.table(predict, file="dump.txt") 

However with the predict method I just get the forecasted ratio of my new elements, and I can't find a way get the decision tree leaf where my new elements belong.

I think it should be pretty easy to get since the predict method must have found that leaf in order to return the ratio.

There are several parameters that can be given to the predict method through the class= argument, but for a regression tree all seem to return the same thing (the value of the target attribute of the decision tree)

Does anyone know how to get the corresponding node in the decision tree?

By analyzing the node with the path.rpart method, it would help me understanding the results.

4

There are 4 answers

0
DianaS On
  1. treeClust::rpart.predict.leaves(tree, validationData) returns node number
  2. also tree$where returns node numbers for the training set
0
Benjamin On

I think what you want is type="vector" instead of class="prob" (I don't think class is an accepted parameter of the predict method), as explained in the rpart docs:

If type="vector": vector of predicted responses. For regression trees this is the mean response at the node, for Poisson trees it is the estimated response rate, and for classification trees it is the predicted class (as a number).

0
Heidi On

You can use the partykit package:

fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)

library("partykit")
fit.party <- as.party(fit)
predict(fit.party, newdata = kyphosis[1:4, ], type = "node")

For your example just set

predict(as.party(tree), newdata = validationData, type = "node")
0
yuji On

Benjamin's answer unfortunately doesn't work: type="vector" still returns the predicted values.

My solution is pretty klugy, but I don't think there's a better way. The trick is to replace the predicted y values in the model frame with the corresponding node numbers.

tree2 = tree
tree2$frame$yval = as.numeric(rownames(tree2$frame))
predict = predict(tree2, newdata=validationData)

Now the output of predict will be node numbers as opposed to predicted y values.

(One note: the above worked in my case where tree was a regression tree, not a classification tree. In the case of a classification tree, you probably need to omit as.numeric or replace it with as.factor.)