I'm running a Naive Bayes model, and using the klaR
package directly is very fast, less than a second to compute on a standard laptop:
mod <- NaiveBayes(category ~ ., data=training, na.action = na.omit)
However, using the caret
packages's train()
interface--which I thought was simply a wrapper for the above function--takes a very long time:
mod <- train(category ~ ., data=training, na.action = na.omit, method="nb")
I'm guessing this is because train
defaults to include some resampling. I tried including trControl = trainControl(method = "none")
but received the following error:
Error in train.default(x, y, weights = w, ...) :
Only one model should be specified in tuneGrid with no resampling
Any ideas why this might occur or general thoughts on the speed difference between the two functions?
Also, is there any chance the speed difference is related to the formula interface? A few of my predictors are factors with over a hundred levels.
Because when you call
caret::train
without specifying any oftrControl, tuneGrid, tuneLength
, it defaults to doing a grid-search over all possible hyperparameters!!... and worse still, it does that grid-search using the default parameters of that particular model (NaiveBayes in this case)!
And the default for
trainControl
is absolutely not what you want:method = "boot", number = 10 or 25
, which is 10/25 entire passes of bootstrap and also saving intermediate results (returnData=T
).So you override one bad default by doing
trControl = trainControl(method = "none")
, but that tickles that it's still doing a grid-search withtuneGrid = NULL, tuneLength = 3
. You need to explicitly set/override those.(as @Khl4v already said in comment)