I wanted to evaluate the performance of several regression model and used the yardstick
package to calculate the RMSE. Here is some example data
model obs pred
1 A 1 1
2 B 1 2
3 C 1 3
When I run the following code
library(yardstick)
library(dplyr)
dat %>%
group_by(model) %>%
summarise(RMSE = yardstick::rmse(truth = obs, estimate = pred))
I get the following error
Error in summarise_impl(.data, dots) : no applicable method for 'rmse' applied to an object of class "c('double', 'numeric')".
However, when I explicitly supply .
as the first argument (which should not be necessary, I thought), I get no error, but the results are incorrect.
dat %>%
group_by(model) %>%
summarise(RMSE = yardstick::rmse(., truth = obs, estimate = pred))
# A tibble: 3 x 2
model RMSE
<fctr> <dbl>
1 A 1.29
2 B 1.29
3 C 1.29
I was expecting the following
# A tibble: 3 x 2
model RMSE
<fctr> <dbl>
1 A 0
2 B 1.00
3 C 2.00
I know that there are alternatives to this function but still I don't understand this behavior.
data
dat <- structure(list(model = structure(1:3, .Label = c("A", "B", "C"), class = "factor"), obs = c(1, 1, 1), pred = 1:3), .Names = c("model", "obs", "pred"), row.names = c(NA, -3L), class = "data.frame")
We can use the
do
function to apply thermse
function to every group.Or we can split the data frame and apply the
rmse
function.Or we can nest the
obs
andpred
column to a list column and then apply thermse
function.The output of these three methods are a little bit different, but all contain the right RMSE calculation. Here I use the
microbenchmark
package to conduct a performance evaluation.The result shows that
m2
is the fastest, whilem1
is the slowest. I think the implication isdo
operation is usually slower then other methods, so if possible, we should avoid thedo
operation. Althoughm2
is the fastest, personally I like the syntax ofm3
the best. The nested data frame will allow us to easily summarize information between different models or different groups.