Following is my example dataset:
# TEMP DATA
train_predictors <- matrix(data = c(1,2,
1,3,
2,4,
3,5,
4,6,
5,4,
6,5,
6,6,
7,7,
8,8), nrow = 10, ncol = 2)
train_labels <- c(1,1,1,1,1,0,0,0,0,0)
test_predictors <- matrix(data = c(1,2), nrow = 1, ncol = 2)
# PREPROCESSING OF DATA
train_predictors <- as.data.frame(train_predictors)
test_predictors <- as.data.frame(test_predictors)
train_labels <- as.factor(train_labels)
And this is how train a simple random forest on train_predictors
and train_labels
.
# APPLY SIMPLE RANDOM FOREST ON TRAIN DATA
my_train_control <- trainControl(method = "cv",
number = 2,
savePredictions = TRUE,
classProbs = TRUE)
rf_model <- train(x = train_predictors,
y = train_labels,
trControl = my_train_control,
tuneLength = 1)
You will get a warning as:
Warning message:
In train.default(x = train_predictors, y = train_labels, trControl = my_train_control, :
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
But this is just because 0, 1 is being used as class labels (so while creating a column in predictions' dataframe, it creates columns as X0 and X1 instead of 0 and 1) - as explained by Max Kuhn (topepo).
I am able to extract class prediction on test datapoint as follows :
prediction_class_on_test_data <- predict(rf_model, test_predictors)
prediction_class_on_test_data <- as.numeric(as.character(prediction_class_on_test_data))
But when I try to predict probability for test datapoint as follows:
prediction_prob_on_test_data <- predict(rf_model, test_predictors, type = "prob")
prediction_prob_on_test_data <- as.numeric(as.character(prediction_prob_on_test_data))
I get following error:
Error in `[.data.frame`(out, , obsLevels, drop = FALSE) :
undefined columns selected
I am sure there is a simple mistake somewhere but what am I doing wrong?
Update:
I am able to get class probabilities and predictions on test dataset using extractProb function as follows:
dummy_test_labels <- rep(0, nrow(test_predictors))
predictions_on_complete_data <- extractProb(models = list(rf_model), testX = test_predictors, testY = dummy_test_labels)
predictions_on_test_data <- predictions_on_complete_data[predictions_on_complete_data$dataType == "Test", ]
But still not sure why predict()
is not working with type="prob"
.