This may be a usage misunderstanding, but I expect the following toy example to work. I want to have a lagged predictor in my recipe, but once I include it in the recipe, and try to predict on the same data using a workflow with the recipe, it doesn't recognize the column foo
and cannot compute its lag.
Now, I can get this to work if I:
- Pull the fit out of the workflow that has been fit.
- Independently prep and bake the data I want to fit.
Which I code after the failed workflow fit, and it succeeds. According to the documentation, I should be able to put a workflow fit in the predict slot: https://www.tidymodels.org/start/recipes/#predict-workflow
I am probably fundamentally misunderstanding how workflow is supposed to operate. I have what I consider a workaround, but I do not understand why the failed statement isn't working in the way the workaround is. I expected the failed workflow construct to work under the covers like the workaround I have.
In short, if work_df
is a dataframe, the_rec
is a recipe based off work_df
, rf_mod
is a model, and you create the workflow rf_workflow
, then should I expect the predict()
function to work identically in the two predict()
calls below?
## Workflow
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(the_rec)
## fit
rf_workflow_fit <-
rf_workflow %>%
fit(data = work_df)
## Predict with workflow. I expect since a workflow has a fit model and
## a recipe as part of it, it should know how to do the following:
predict(rf_workflow_fit, work_df)
#> Error: Problem with `mutate()` input `lag_1_foo`.
#> x object 'foo' not found
#> i Input `lag_1_foo` is `dplyr::lag(x = foo, n = 1L, default = NA)`.
## Predict by explicitly prepping and baking the data, and pulling out the
## fit from the workflow:
predict(
rf_workflow_fit %>%
pull_workflow_fit(),
bake(prep(the_rec), work_df))
#> # A tibble: 995 x 1
#> .pred
#> <dbl>
#> 1 2.24
#> 2 0.595
#> 3 0.262
Full reprex example below.
library(tidymodels)
#> -- Attaching packages -------------------------------------------------------------------------------------- tidymodels 0.1.1 --
#> v broom 0.7.1 v recipes 0.1.13
#> v dials 0.0.9 v rsample 0.0.8
#> v dplyr 1.0.2 v tibble 3.0.3
#> v ggplot2 3.3.2 v tidyr 1.1.2
#> v infer 0.5.3 v tune 0.1.1
#> v modeldata 0.0.2 v workflows 0.2.1
#> v parsnip 0.1.3 v yardstick 0.0.7
#> v purrr 0.3.4
#> -- Conflicts ----------------------------------------------------------------------------------------- tidymodels_conflicts() --
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
library(dplyr)
set.seed(123)
### Create autocorrelated timeseries: https://stafoo.stackexchange.com/a/29242/17203
work_df <-
tibble(
foo = stats::filter(rnorm(1000), filter=rep(1,5), circular=TRUE) %>%
as.numeric()
)
# plot(work_df$foo)
work_df
#> # A tibble: 1,000 x 1
#> foo
#> <dbl>
#> 1 -0.00375
#> 2 0.589
#> 3 0.968
#> 4 3.24
#> 5 3.93
#> 6 1.11
#> 7 0.353
#> 8 -0.222
#> 9 -0.713
#> 10 -0.814
#> # ... with 990 more rows
## Recipe
the_rec <-
recipe(foo ~ ., data = work_df) %>%
step_lag(foo, lag=1:5) %>%
step_naomit(all_predictors())
the_rec %>% prep() %>% juice()
#> # A tibble: 995 x 6
#> foo lag_1_foo lag_2_foo lag_3_foo lag_4_foo lag_5_foo
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.11 3.93 3.24 0.968 0.589 -0.00375
#> 2 0.353 1.11 3.93 3.24 0.968 0.589
#> 3 -0.222 0.353 1.11 3.93 3.24 0.968
#> 4 -0.713 -0.222 0.353 1.11 3.93 3.24
#> 5 -0.814 -0.713 -0.222 0.353 1.11 3.93
#> 6 0.852 -0.814 -0.713 -0.222 0.353 1.11
#> 7 1.65 0.852 -0.814 -0.713 -0.222 0.353
#> 8 1.54 1.65 0.852 -0.814 -0.713 -0.222
#> 9 2.10 1.54 1.65 0.852 -0.814 -0.713
#> 10 2.24 2.10 1.54 1.65 0.852 -0.814
#> # ... with 985 more rows
## Model
rf_mod <-
rand_forest(
mtry = 4,
trees = 1000,
min_n = 13) %>%
set_mode("regression") %>%
set_engine("ranger")
## Workflow
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(the_rec)
## fit
rf_workflow_fit <-
rf_workflow %>%
fit(data = work_df)
## Predict
predict(rf_workflow_fit, work_df)
#> Error: Problem with `mutate()` input `lag_1_foo`.
#> x object 'foo' not found
#> i Input `lag_1_foo` is `dplyr::lag(x = foo, n = 1L, default = NA)`.
## Perhaps I just need to pull off the fit and work with that?... Nope.
predict(
rf_workflow_fit %>%
pull_workflow_fit(),
work_df)
#> Error: Can't subset columns that don't exist.
#> x Columns `lag_1_foo`, `lag_2_foo`, `lag_3_foo`, `lag_4_foo`, and `lag_5_foo` don't exist.
## Maybe I need to bake it first... and that works.
## But doesn't that defeat the purpose of a workflow?
predict(
rf_workflow_fit %>%
pull_workflow_fit(),
bake(prep(the_rec), work_df))
#> # A tibble: 995 x 1
#> .pred
#> <dbl>
#> 1 2.24
#> 2 0.595
#> 3 0.262
#> 4 -0.977
#> 5 -1.24
#> 6 -0.140
#> 7 1.36
#> 8 1.30
#> 9 1.78
#> 10 2.42
#> # ... with 985 more rows
## Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.3 (2020-02-29)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/Chicago
#> date 2020-10-13
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.3)
#> backports 1.1.10 2020-09-15 [1] CRAN (R 3.6.3)
#> broom * 0.7.1 2020-10-02 [1] CRAN (R 3.6.3)
#> class 7.3-15 2019-01-01 [1] CRAN (R 3.6.3)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> codetools 0.2-16 2018-12-24 [1] CRAN (R 3.6.3)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.3)
#> dials * 0.0.9 2020-09-16 [1] CRAN (R 3.6.3)
#> DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.3)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.3)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.3)
#> foreach 1.5.0 2020-03-30 [1] CRAN (R 3.6.3)
#> furrr 0.1.0 2018-05-16 [1] CRAN (R 3.6.3)
#> future 1.19.1 2020-09-22 [1] CRAN (R 3.6.3)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.3)
#> ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 3.6.3)
#> globals 0.13.0 2020-09-17 [1] CRAN (R 3.6.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.3)
#> gower 0.2.2 2020-06-23 [1] CRAN (R 3.6.3)
#> GPfit 1.0-8 2019-02-08 [1] CRAN (R 3.6.3)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.3)
#> hardhat 0.1.4 2020-07-02 [1] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.3)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 3.6.3)
#> infer * 0.5.3 2020-07-14 [1] CRAN (R 3.6.3)
#> ipred 0.9-9 2019-04-28 [1] CRAN (R 3.6.3)
#> iterators 1.0.12 2019-07-26 [1] CRAN (R 3.6.3)
#> knitr 1.30 2020-09-22 [1] CRAN (R 3.6.3)
#> lattice 0.20-38 2018-11-04 [1] CRAN (R 3.6.3)
#> lava 1.6.8 2020-09-26 [1] CRAN (R 3.6.3)
#> lhs 1.1.1 2020-10-05 [1] CRAN (R 3.6.3)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> listenv 0.8.0 2019-12-05 [1] CRAN (R 3.6.3)
#> lubridate 1.7.9 2020-06-08 [1] CRAN (R 3.6.3)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.3)
#> MASS 7.3-51.5 2019-12-20 [1] CRAN (R 3.6.3)
#> Matrix 1.2-18 2019-11-27 [1] CRAN (R 3.6.3)
#> modeldata * 0.0.2 2020-06-22 [1] CRAN (R 3.6.3)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.3)
#> nnet 7.3-12 2016-02-02 [1] CRAN (R 3.6.3)
#> parsnip * 0.1.3 2020-08-04 [1] CRAN (R 3.6.3)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 3.6.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.3)
#> plyr 1.8.6 2020-03-03 [1] CRAN (R 3.6.3)
#> pROC 1.16.2 2020-03-19 [1] CRAN (R 3.6.3)
#> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 3.6.3)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.3)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.3)
#> ranger 0.12.1 2020-01-10 [1] CRAN (R 3.6.3)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 3.6.3)
#> recipes * 0.1.13 2020-06-23 [1] CRAN (R 3.6.3)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 3.6.3)
#> rmarkdown 2.4 2020-09-30 [1] CRAN (R 3.6.3)
#> rpart 4.1-15 2019-04-12 [1] CRAN (R 3.6.3)
#> rsample * 0.0.8 2020-09-23 [1] CRAN (R 3.6.3)
#> rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.3)
#> scales * 1.1.1 2020-05-11 [1] CRAN (R 3.6.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.3)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 3.6.3)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.3)
#> survival 3.1-8 2019-12-03 [1] CRAN (R 3.6.3)
#> tibble * 3.0.3 2020-07-10 [1] CRAN (R 3.6.3)
#> tidymodels * 0.1.1 2020-07-14 [1] CRAN (R 3.6.3)
#> tidyr * 1.1.2 2020-08-27 [1] CRAN (R 3.6.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> timeDate 3043.102 2018-02-21 [1] CRAN (R 3.6.3)
#> tune * 0.1.1 2020-07-08 [1] CRAN (R 3.6.3)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.3)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 3.6.3)
#> withr 2.3.0 2020-09-22 [1] CRAN (R 3.6.3)
#> workflows * 0.2.1 2020-10-08 [1] CRAN (R 3.6.3)
#> xfun 0.18 2020-09-29 [1] CRAN (R 3.6.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.3)
#> yardstick * 0.0.7 2020-07-13 [1] CRAN (R 3.6.3)
#>
#> [1] C:/Users/IRINZN/Documents/R/R-3.6.3/library
Created on 2020-10-13 by the reprex package (v0.3.0)
The reason you are experiencing an error is that you have created a predictor variable from the outcome. When it comes time to predict on new data, the outcome is not available; we are predicting the outcome for new data, not assuming that it is there already.
This is a fairly strong assumption of the tidymodels framework, for either modeling or preprocessing, to protect against information leakage. You can read about this a bit more here.
It's possible you already know about these resources, but if you are working with time series models, I'd suggest checking out these resources: