R caretEnsemble CV length incorrect

137 views Asked by At

I am trying to ensemble models using the package caretEnsemble in R. Here is a minimally reproducible example. Please let me know if this should have extra information.

library(caret)
library(caretEnsemble)
library(xgboost)
library(plyr)


# Load iris data and convert to binary classification problem
data(iris)
data = iris
data$target = ifelse(data$Species == "setosa",1,0)
data = subset(data,select = -c(Species))

# Train control for models. 5 fold CV
set.seed(123)
index=createFolds(data$target, k=5,returnTrain = FALSE)
myControl = trainControl(method='cv', number=5,
                          returnResamp='none', classProbs=TRUE,
                          returnData=FALSE, savePredictions=TRUE, 
                          verboseIter=FALSE, allowParallel=TRUE,
                          summaryFunction=twoClassSummary,
                          index=index)

# Layer 1 models
model1 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "glm", family = "binomial", metric = "ROC")
model2 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "xgbTree", metric = "ROC",
               tuneGrid=expand.grid(nrounds = 50, max_depth=1, eta =  .05,                                                                                           gamma = .5, colsample_bytree = 1,min_child_weight=1, subsample=1))

# Stack models
all.models <- list(model1, model2)
names(all.models) <- c("glm","xgb")
class(all.models) <- "caretList"

stacked <- caretStack(all.models, method = "glm", family = "binomial", metric = "ROC",
                          trControl=trainControl(method='cv', number=5,
                          returnResamp='none', classProbs=TRUE,
                          returnData=FALSE, savePredictions=TRUE, 
                          verboseIter=FALSE, allowParallel=TRUE,
                          summaryFunction=twoClassSummary)
                          )

stacked

This is the main output that concerns me.

A glm ensemble of 2 base models: glm, xgb

Ensemble results:
Generalized Linear Model 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 480, 480, 480, 480, 480 
Resampling results:

  ROC        Sens  Spec 
  0.9509688  0.92  0.835

My issue is that there are 150 rows in the base data set, so 30 rows in each fold of the 5 fold CV. If you look at "index" you'll see that this is working correctly. Now if you look at the results of "stacked" you'll see that the 5 fold length of the meta/stacked model is 480 for each fold. This is 480*5 = 2400 in total, which is 16 times larger than the original data set. I have no idea why this is.

My main questions are:
1) Is this list of observations in each fold correct?
2) If so, why is this happening?

1

There are 1 answers

0
user137698 On

Figured out the issue in case anyone else stumbles on this. The index I created is an indicator of the out of sample rows, so the code should be:

myControl = trainControl(method='cv', number=5,
                          returnResamp='none', classProbs=TRUE,
                          returnData=FALSE, savePredictions=TRUE, 
                          verboseIter=FALSE, allowParallel=TRUE,
                          summaryFunction=twoClassSummary,
                          indexOut=index)

Instead of index= it should be indexOut=. The data was training on 20% of the data and predicting on 80% before, which explains the overlap. Now that this option is properly set there is no overlap.