I wanted to evaluate the performance of several regression model and used the yardstick package to calculate the RMSE. Here is some example data
model obs pred
1 A 1 1
2 B 1 2
3 C 1 3
When I run the following code
library(yardstick)
library(dplyr)
dat %>%
group_by(model) %>%
summarise(RMSE = yardstick::rmse(truth = obs, estimate = pred))
I get the following error
Error in summarise_impl(.data, dots) : no applicable method for 'rmse' applied to an object of class "c('double', 'numeric')".
However, when I explicitly supply . as the first argument (which should not be necessary, I thought), I get no error, but the results are incorrect.
dat %>%
group_by(model) %>%
summarise(RMSE = yardstick::rmse(., truth = obs, estimate = pred))
# A tibble: 3 x 2
model RMSE
<fctr> <dbl>
1 A 1.29
2 B 1.29
3 C 1.29
I was expecting the following
# A tibble: 3 x 2
model RMSE
<fctr> <dbl>
1 A 0
2 B 1.00
3 C 2.00
I know that there are alternatives to this function but still I don't understand this behavior.
data
dat <- structure(list(model = structure(1:3, .Label = c("A", "B", "C"), class = "factor"), obs = c(1, 1, 1), pred = 1:3), .Names = c("model", "obs", "pred"), row.names = c(NA, -3L), class = "data.frame")
We can use the
dofunction to apply thermsefunction to every group.Or we can split the data frame and apply the
rmsefunction.Or we can nest the
obsandpredcolumn to a list column and then apply thermsefunction.The output of these three methods are a little bit different, but all contain the right RMSE calculation. Here I use the
microbenchmarkpackage to conduct a performance evaluation.The result shows that
m2is the fastest, whilem1is the slowest. I think the implication isdooperation is usually slower then other methods, so if possible, we should avoid thedooperation. Althoughm2is the fastest, personally I like the syntax ofm3the best. The nested data frame will allow us to easily summarize information between different models or different groups.