I have two data frames (here for reproducibility) trainFin1
and trainFin2
, both sampled from a same bigger dataset.
I'm trying to run cross-validated rpart
on them using caret
over multiprocessor using doSNOW
package.
Interestingly, trainFin1
was trained nicely across 4 processors (finishing in about 25 seconds). But trainFin2
seems to be stuck only on one processor (observed in Windows Task Manager window), and I never get to see it finish processing even after almost half an hour.
My code below
require(caret)
require(rpart)
load("trainFin.RData")
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
#setup parallel processing
require(doSNOW)
cl <- makeCluster(4, type = "SOCK")
registerDoSNOW(cl)
#train
set.seed(12345)
firstSet <- train(x = trainFin1[, names(trainFin1) != "Happiness"],
y = trainFin1$Happiness,
method = "rpart2", trControl = fitControl)
set.seed(12345)
secondSet <- train(x = trainFin2[, names(trainFin2) != "Happiness"],
y = trainFin2$Happiness,
method = "rpart2", trControl = fitControl)
stopCluster(cl)
Do note that I avoided use of formula
in train
and instead feed it raw data, to avoid caret
converting my ordinal variables into dummy categorical variables (see answer to this question). When I used formula
(i.e. train(Happiness ~ ., data = trainFin2, method = "rpart2", trControl = fitControl)
), there seems to be no issue with parallel processing. But I want to avoid using formula
as per the other question.
Any suggestions on how I can parallel-process this data without converting the predictors to categorical dummies ?