Speed difference between caret and klaR packages, for Naive Bayes

1.6k views Asked by At

I'm running a Naive Bayes model, and using the klaR package directly is very fast, less than a second to compute on a standard laptop:

mod <- NaiveBayes(category ~ ., data=training, na.action = na.omit)

However, using the caret packages's train() interface--which I thought was simply a wrapper for the above function--takes a very long time:

mod <- train(category ~ ., data=training, na.action = na.omit, method="nb")

I'm guessing this is because train defaults to include some resampling. I tried including trControl = trainControl(method = "none") but received the following error:

Error in train.default(x, y, weights = w, ...) : Only one model should be specified in tuneGrid with no resampling

Any ideas why this might occur or general thoughts on the speed difference between the two functions?

Also, is there any chance the speed difference is related to the formula interface? A few of my predictors are factors with over a hundred levels.

1

There are 1 answers

0
smci On

Because when you call caret::train without specifying any of trControl, tuneGrid, tuneLength, it defaults to doing a grid-search over all possible hyperparameters!!

trControl = trainControl(), tuneGrid = NULL, tuneLength = 3

... and worse still, it does that grid-search using the default parameters of that particular model (NaiveBayes in this case)!

And the default for trainControl is absolutely not what you want: method = "boot", number = 10 or 25, which is 10/25 entire passes of bootstrap and also saving intermediate results (returnData=T).

So you override one bad default by doing trControl = trainControl(method = "none"), but that tickles that it's still doing a grid-search with tuneGrid = NULL, tuneLength = 3. You need to explicitly set/override those.

(as @Khl4v already said in comment)