How to predict probabilities on test dataset in R's caret package?

64 views Asked by At

Following is my example dataset:

# TEMP DATA
train_predictors <- matrix(data = c(1,2,
                                    1,3,
                                    2,4,
                                    3,5,
                                    4,6,
                                    5,4,
                                    6,5,
                                    6,6,
                                    7,7,
                                    8,8), nrow = 10, ncol = 2)

train_labels <- c(1,1,1,1,1,0,0,0,0,0)
test_predictors <- matrix(data = c(1,2), nrow = 1, ncol = 2)

# PREPROCESSING OF DATA
train_predictors <- as.data.frame(train_predictors)
test_predictors <- as.data.frame(test_predictors)
train_labels <- as.factor(train_labels)

And this is how train a simple random forest on train_predictors and train_labels.

# APPLY SIMPLE RANDOM FOREST ON TRAIN DATA
my_train_control <- trainControl(method = "cv", 
                                 number = 2, 
                                 savePredictions = TRUE, 
                                 classProbs = TRUE)

rf_model <- train(x = train_predictors, 
                  y = train_labels, 
                  trControl = my_train_control, 
                  tuneLength = 1)

You will get a warning as:

Warning message:
In train.default(x = train_predictors, y = train_labels, trControl = my_train_control,  :
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1

But this is just because 0, 1 is being used as class labels (so while creating a column in predictions' dataframe, it creates columns as X0 and X1 instead of 0 and 1) - as explained by Max Kuhn (topepo).

I am able to extract class prediction on test datapoint as follows :

prediction_class_on_test_data <- predict(rf_model, test_predictors)
prediction_class_on_test_data <- as.numeric(as.character(prediction_class_on_test_data))

But when I try to predict probability for test datapoint as follows:

prediction_prob_on_test_data <- predict(rf_model, test_predictors, type = "prob")
prediction_prob_on_test_data <- as.numeric(as.character(prediction_prob_on_test_data))

I get following error:

Error in `[.data.frame`(out, , obsLevels, drop = FALSE) : 
    undefined columns selected

I am sure there is a simple mistake somewhere but what am I doing wrong?

Update:

I am able to get class probabilities and predictions on test dataset using extractProb function as follows:

dummy_test_labels <- rep(0, nrow(test_predictors))
predictions_on_complete_data <- extractProb(models = list(rf_model), testX = test_predictors, testY = dummy_test_labels)
predictions_on_test_data <- predictions_on_complete_data[predictions_on_complete_data$dataType == "Test", ]

But still not sure why predict() is not working with type="prob".

0

There are 0 answers