I want to train a random forest to make a categorical prediction. If I want to include a fixed set of independent variables in the prediction model (e.g. x1, x2, and x3 in Y~.+x1+x2+x3
), but exclude them from the set of independent variables (represented by . in the example) that can be used to partition the data/create branches/trees in the forest, is there a simple way to do this using caret
, grf
, or another package in R?
Here's an example: If I wanted to predict which flowers had sepal width over 3.2 in the iris dataset, but I wanted to condition on flower species when deciding whether to create a new branch while excluding flower species as a possible variable to split on. Imagine that I know that flower species is a good predictor of sepal width, but I want to know what other factors predict sepal width, conditional on species:
data(iris)
d <- iris
d$sepal_width_over3point2<-as.factor(d$Sepal.Width>3.2)
d$Type1<-as.numeric(d$Species=='versicolor')
d$Type2<-as.numeric(d$Species=='virginica')
d$Type3<-as.numeric(d$Species=='setosa')
d<-subset(d,select=-c(Species,Sepal.Width))
## Set parameters to train models
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
# Random Forest
set.seed(11)
rf <- train(sepal_width_over3point2~.+Type1+Type2+Type3, data=d, method="rf", metric=metric, trControl=control)
print(rf)
example_varImp_rf<-varImp(rf)
When I look at the variable importance in this model, I'd like to know that the estimates for the other parameters (Sepal.length, Petal.length, and Petal.width) are conditional on flower Type1, Type2, and Type3, but exclude these variables as possible variables to branch on. Is there a way to tell the random forest to ignore these three variables as possible splits?
That would require your node splits to have one threshold for each flower species, which would be more computationally expensive than most tree learners. I don't know of any package that implements this.
One possible workaround is to do some feature engineering. In this case, where your condition on is a smallish categorical, you could standardize each feature relative to their flower species, so that a split would be something like "sepal length is at least 20% higher than species average" or "sepal length is at least one (species) standard deviation higher than species average."