I would like to apply the Random Forest algorithm from the package mlr
to a data set. This is the Zoo
dataset from the package mlbench
.
data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)
zooTib <- mutate_if(zooTib, is.logical, as.factor)
But before that I have introduced random NAs, only the target variable type
I have left complete.
zooTibOrig <- zooTib
zooTib <- apply (zooTib[,1:ncol(zooTib)-1], 2, function(x) {x[sample( c(1:nrow(zooTib)), floor(nrow(zooTib)/10))] <- NA; x} )
zooTib <- cbind(zooTib, zooTibOrig[,ncol(zooTibOrig)])
zooTib
Before testing the random forest algorithm, I ran the zoo dataset with the NAs through a simple RPART decision tree algorithm. This has the possibility to process datasets with NAs due to its parameters "maxsurrogate" or "usesurroagte". So I could pass the dataset without any problems and the code was executed without any problems.
Next I wanted to use the above mentioned random forest algorithm.
forest <- makeLearner("classif.randomForest")
forestParamSpace <- makeParamSet(makeIntegerParam("ntree", lower = 300, upper = 300), makeIntegerParam("mtry", lower = 6, upper = 12), makeIntegerParam("nodesize", lower = 1, upper = 5), makeIntegerParam("maxnodes", lower = 5, upper = 20))
randSearch <- makeTuneControlRandom(maxit = 100)
cvForTuning <- makeResampleDesc("CV", iters = 5)
tunedForestPars <- tuneParams(forest, task = zooTask,
resampling = cvForTuning,
par.set = forestParamSpace,
control = randSearch)
However, as soon as I wanted to run the parameter tuning process I got the error message:
"Error in checkLearnerBeforeTrain(task, learner, weights) : Task 'zooTib' has missing values in 'hair, feathers, eggs, milk, airborne, aquatic, ...', but learner 'classif.randomForest' does not support that!"
This is strange, since a Random Forest is "merely" an ensemble of several Decision Trees - which in turn can handle Missing Values. I acutally tried it before and the RPART algorithm worked perfectly fine.
I wanted to try to set the Surrogate Split parameter, but for a Random Forest this setting does not exist. When I execute the function getParamSet(forest)
unfortunately no surrogate splits appear there.
Is there a possibility to somehow pass records containing NAs to a random forest.