I have been delving into the R package caret recently, and have a question about reproducibility and comparison of models during training that I haven't quite been able to pin down.

My intention is that each train call, and thus each resulting model, uses the same cross validation splits so that the initial stored results from the cross-validation are comparable from the out-of-sample estimations of the model that are calculated during building.

One method I've seen is that you can specify the seed prior to each train call as such:

set.seed(1)
model <- train(..., trControl = trainControl(...))
set.seed(1)
model2 <- train(..., trControl = trainControl(...))
set.seed(1)
model3 <- train(..., trControl = trainControl(...))

However, does sharing a trainControl object between the train calls mean that they are using the same resampling and indexes generally or whether I have to explicitly pass the seeds argument into the function. Does the train control object have random functions when it is used or are they set on declaration?

My current method has been:

set.seed(1)
train_control <- trainControl(method="cv", ...)
model1 <- train(..., trControl = train_control)
model2 <- train(..., trControl = train_control)
model3 <- train(..., trControl = train_control)

Are these train calls going to be using the same splits and be comparable, or do I have to take further steps to ensure that? i.e. specifying seeds when the trainControl object is made, or calling set.seed before each train? Or both?

Hopefully this has made some sense, and isn't a complete load of rubbish. Any help


My code project that I'm querying about can be found here. It might be easier to read it and you'll understand.

1

There are 1 answers

3
missuse On BEST ANSWER

The CV folds are not created during defining trainControl unless explicitly stated using index argument which I recommend. These can be created using one of the specialized caret functions:

createFolds
createMultiFolds
createTimeSlices
groupKFold

That being said, using a specific seed prior to trainControl definition will not result in the same CV folds.

Example:

library(caret)
library(tidyverse)

set.seed(1)
trControl = trainControl(method = "cv",
                         returnResamp = "final",
                         savePredictions = "final")

create two models:

knnFit1 <- train(iris[,1:4], iris[,5],
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trControl)

ldaFit2 <- train(iris[,1:4], iris[,5],
                 method = "lda",
                 tuneLength = 10,
                 trControl = trControl)

check if the same indexes are in the same folds:

knnFit1$pred %>%
  left_join(ldaFit2$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}
#FALSE

If you set the same seed prior each train call

set.seed(1)
knnFit1 <- train(iris[,1:4], iris[,5],
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trControl)

set.seed(1)
ldaFit2 <- train(iris[,1:4], iris[,5],
                 method = "lda",
                 tuneLength = 10,
                 trControl = trControl)


set.seed(1)
rangerFit3 <- train(iris[,1:4], iris[,5],
                 method = "ranger",
                 tuneLength = 10,
                 trControl = trControl)


knnFit1$pred %>%
  left_join(ldaFit2$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}

knnFit1$pred %>%
  left_join(rangerFit3$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}

the same indexes will be used in the folds. However I would not rely on this method when using parallel computation. Therefore in order to insure the same data splits are used it is best to define them manually using index/indexOut arguments to trainControl.

When you set the index argument manually the folds will be the same, however this does not ensure that models made by the same method will be the same, since most methods include some sort of stochastic process. So to be fully reproducible it is advisable to set the seed prior to each train call also. When run in parallel to get fully reproducible models the seeds argument to trainControl needs to be set.