Apply the Random Forest Algorithm to a Dataset containing missing values

259 views Asked by At

I would like to apply the Random Forest algorithm from the package mlr to a data set. This is the Zoo dataset from the package mlbench.

data(Zoo, package = "mlbench")
zooTib <- as_tibble(Zoo)
zooTib <- mutate_if(zooTib, is.logical, as.factor)

But before that I have introduced random NAs, only the target variable type I have left complete.

zooTibOrig <- zooTib
zooTib <- apply (zooTib[,1:ncol(zooTib)-1], 2, function(x) {x[sample( c(1:nrow(zooTib)), floor(nrow(zooTib)/10))] <- NA; x} )
zooTib <- cbind(zooTib, zooTibOrig[,ncol(zooTibOrig)])
zooTib

Before testing the random forest algorithm, I ran the zoo dataset with the NAs through a simple RPART decision tree algorithm. This has the possibility to process datasets with NAs due to its parameters "maxsurrogate" or "usesurroagte". So I could pass the dataset without any problems and the code was executed without any problems.

Next I wanted to use the above mentioned random forest algorithm.

forest <- makeLearner("classif.randomForest")
forestParamSpace <- makeParamSet(makeIntegerParam("ntree", lower = 300, upper = 300), makeIntegerParam("mtry", lower = 6, upper = 12), makeIntegerParam("nodesize", lower = 1, upper = 5), makeIntegerParam("maxnodes", lower = 5, upper = 20))
randSearch <- makeTuneControlRandom(maxit = 100)
cvForTuning <- makeResampleDesc("CV", iters = 5)

tunedForestPars <- tuneParams(forest, task = zooTask,
                          resampling = cvForTuning,
                          par.set = forestParamSpace,
                          control = randSearch)

However, as soon as I wanted to run the parameter tuning process I got the error message:

"Error in checkLearnerBeforeTrain(task, learner, weights) : Task 'zooTib' has missing values in 'hair, feathers, eggs, milk, airborne, aquatic, ...', but learner 'classif.randomForest' does not support that!"

This is strange, since a Random Forest is "merely" an ensemble of several Decision Trees - which in turn can handle Missing Values. I acutally tried it before and the RPART algorithm worked perfectly fine.

I wanted to try to set the Surrogate Split parameter, but for a Random Forest this setting does not exist. When I execute the function getParamSet(forest) unfortunately no surrogate splits appear there.

Is there a possibility to somehow pass records containing NAs to a random forest.

0

There are 0 answers