While testing the accuracy of some models using fable
, I found an interesting behavior with fabletools::skill_score
. skill_score is described in the FPP3 book. If you calculate the test accuracy of a set of models that include a NAIVE/SNAIVE model with skill_score(CRPS) with no transformation of the target variable, the NAIVE/SNAIVE model has a skill_score of 0. This aligns with the description in the FPP3 book:
the proportion that the ... method improves over the naïve method based on CRPS
However, if you transform the target variable somehow (ex. log(x + 1)
), the NAIVE/SNAIVE model does not have a skill_score of 0. This indicates to me that the skill_score function might not be honoring the transformation of the target variable. I looked at the source code and did not see any reference to transformations.
Is this the expected behavior of skill_score? If so, is there a way to carry the transformation over to skill_score? Or is skill_score not appropriate for models with transformed target variables?
This code replicates the expected behavior of skill_score on untransformed data:
library(fpp3)
google_stock <- gafa_stock |>
filter(Symbol == "GOOG", year(Date) >= 2015) |>
mutate(day = row_number()) |>
update_tsibble(index = day, regular = TRUE)
google_stock |>
autoplot()
test <- google_stock |>
slice_tail(prop = .8)
train <- google_stock |>
anti_join(test)
fitted_model <- train |>
model(
Mean = MEAN(Close),
`Naïve` = NAIVE(Close),
Drift = NAIVE(Close ~ drift())
)
goog_fc <- fitted_model |>
forecast(h = 12)
fc_acc <- goog_fc |>
accuracy(google_stock,
measures = list(point_accuracy_measures, distribution_accuracy_measures, crps_skill = skill_score(CRPS))) |>
select(.model, .type, CRPS, crps_skill, RMSSE)
fc_acc
# A tibble: 3 × 5
.model .type CRPS crps_skill RMSSE
<chr> <chr> <dbl> <dbl> <dbl>
1 Drift Test 38.2 0.0955 5.09
2 Mean Test 109. -1.59 12.6
3 Naïve Test 42.2 0 5.49
This replicates the unexpected behavior with the same data transformed with log(x + 1):
fitted_model_transformed <- train |>
model(
Mean = MEAN(log(Close + 1)),
`Naïve` = NAIVE(log(Close + 1)),
Drift = NAIVE(log(Close + 1) ~ drift())
)
goog_fc_transformed <- fitted_model_transformed |>
forecast(h = 12)
fc_acc_transformed <- goog_fc_transformed |>
accuracy(google_stock,
measures = list(point_accuracy_measures, distribution_accuracy_measures, crps_skill = skill_score(CRPS))) |>
select(.model, .type, CRPS, crps_skill, RMSSE)
fc_acc_transformed
# A tibble: 3 × 5
.model .type CRPS crps_skill RMSSE
<chr> <chr> <dbl> <dbl> <dbl>
1 Drift Test 36.3 0.140 4.97
2 Mean Test 110. -1.61 12.6
3 Naïve Test 40.8 0.0353 5.42
I would expect the Naïve model crps_skill to be 0, because it cannot improve on itself.
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] fable_0.3.3 feasts_0.3.1 fabletools_0.3.4 tsibbledata_0.4.1 tsibble_1.1.3 ggplot2_3.4.3 lubridate_1.9.2
[8] tidyr_1.3.0 dplyr_1.1.3 tibble_3.2.1 fpp3_0.5
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 plotly_4.10.2 utf8_1.2.4 generics_0.1.3 anytime_0.3.9 digest_0.6.33
[7] magrittr_2.0.3 grid_4.3.1 timechange_0.2.0 pkgload_1.3.2.1 fastmap_1.1.1 jsonlite_1.8.7
[13] modeldata_1.2.0 httr_1.4.7 purrr_1.0.2 fansi_1.0.5 viridisLite_0.4.2 scales_1.2.1
[19] numDeriv_2016.8-1.1 textshaping_0.3.6 lazyeval_0.2.2 cli_3.6.1 rlang_1.1.1 crayon_1.5.2
[25] ellipsis_0.3.2 munsell_0.5.0 withr_2.5.1 tools_4.3.1 colorspace_2.1-0 vctrs_0.6.4
[31] R6_2.5.1 lifecycle_1.0.3 htmlwidgets_1.6.2 ragg_1.2.5 pkgconfig_2.0.3 progressr_0.14.0
[37] pillar_1.9.0 gtable_0.3.4 rsconnect_1.1.0 data.table_1.14.8 glue_1.6.2 Rcpp_1.0.11
[43] systemfonts_1.0.4 tidyselect_1.2.0 rstudioapi_0.15.0 farver_2.1.1 htmltools_0.5.6 labeling_0.4.3
[49] compiler_4.3.1 distributional_0.3.2
You can use several different transformations in the same
model()
call, so it makes no sense forskill_score()
to use a benchmark model with anything other than no transformation. Otherwise, the scores for different models could use different benchmarks. Consequently, the benchmark Naive method must use an untransformed variable.