I am trying to implement the use of conditional inference trees (by package partykit
) as induction trees, which purpose is merely describing and not predicting individual cases. According to Ritschard here, here and there, for example, a measure of deviance can be estimated by comparing by means of cross-tabs the real and estimated distributions of the response variable in relationship to the possible predictors-based profiles, the so called ^T T and tables.
I would like to use deviance and other derivated statistics as a GOF measure of objects obtained by ctree()
function. I am introducing myself to this topic, and I would very much appreciate some input, such as a piece of R code or some orientation about the structure of ctree
objects that could be involved in the coding.
I have thought myself that I could, from scratch, obtain both target and predicted tables and compute later the deviance formula. I confess being not confident at all about how to proceed though.
Thanks a lot beforehand!
Some background information first: We have discussed adding
deviance()
orlogLik()
methods forctree
objects. So far we haven't done so because conditional inference trees are not associated with a particular loss function or even likelihood. Instead, only the associations between response and partitioning variables are assessed by means of conditional inference tests using certain influence and regressor transformations. However, for the default regression and classification case, measures of deviance or log-likelihood can be a useful addition in practice. So maybe we will add these methods in future versions.If you want to consider trees associated with a formal deviance/likelihood, you may consider using the general
mob()
framework or thelmtree()
andglmtree()
convenience functions. If only partitioning variables are specified (and no further regressors to be used in every node), these often lead to very similar trees compared toctree()
. But then you can also useAIC()
etc.But to come back to your original question: You can compute deviance/log-likelihood or other loss functions fairly easily if you look at the model response and the fitted response. Alterantively, you can extract a
factor
variable that indicates the terminal nodes and refit a linear or multinomial model. This will have the same fitted values but also supplydeviance()
andlogLik()
. Below, I illustrate this with theairct
andirisct
trees that you obtain when runningexample("ctree", package = "partykit")
.Regression: The Gaussian deviance is simply the residual sum of squares:
The same can be obtained by re-fitting as a linear regression model:
Classification: The log-likelihood is simply the sum of the predicted log-probabilities at the observed classes. And the deviance is -2 times the log-likelihood:
Again, this can also be obtained by re-fitting as a multinomial model:
See also the discussion in https://stats.stackexchange.com/questions/6581/what-is-deviance-specifically-in-cart-rpart for the connections between regression and classification trees and generalized linear models.