Recipes package: Use grepl to populate the model formula using recipes()

52 views Asked by At

I am using the Ames Housing Data and I want to use all the variables with the suffix "SF" in my recipe, I want to use step_pca() on the variables that are measure by squared feet.

I used reformulate() to no avail:

SF <- reformulate(grep("SF", names(ames), value = TRUE), 
              response = 'Sale_Price')
simple_ames <- 
  recipe(SF + Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude, 
                        data = ames_train) %>% 
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_interact(~ Gr_Liv_Area:starts_with('Bldg_Type_')) %>% 
  step_ns(Latitude, deg_free = 20) %>% 
  step_pca(matches('(SF$)|(Gr_Liv'))

Also used grep() directly into the formula

 simple_ames <- 
   recipe(Sale_Price ~ paste(grep("SF"), collapse = '+') + Neighborhood + 
   Gr_Liv_Area + Year_Built + Bldg_Type + Latitude, data = ames_train) %>% 
   step_log(Gr_Liv_Area, base = 10) %>% 
   step_other(Neighborhood, threshold = 0.01) %>%
   step_dummy(all_nominal_predictors()) %>% 
   step_interact(~ Gr_Liv_Area:starts_with('Bldg_Type_')) %>% 
   step_ns(Latitude, deg_free = 20) %>% 
   step_pca(matches('(SF$)|(Gr_Liv'))

I am using the examples from Tidy Modelling with R, https://www.tmwr.org/recipes chapter 8.4.4 (authors do not explain a efficient way to insert all those variables into recipe)

Thanks

2

There are 2 answers

0
ThomasK81 On

In the recipes package only the selector functions from recipes and tidyselect are allowed. Custom functions will fail. To do what you want to do try:

step_pca(ends_with("SF") , contains("Gr_Liv"))

If you insist on a regex, remember that tidyselect matches (which recipes utilises) uses stringr style regex and the following should work:

step_pca(matches("SF$|Gr_Liv"))

You can always test your selector if you use tidyselect selectors within recipes by actually applying it to the data you are using e.g. ames |> select(matches("SF$|Gr_Liv")). That helps to make sure that you operate on the predictors you want.

See also ?recipes::selections for a more thorough explanation.

1
toku_mo On

For anyone who would have a similar problem, this is the solution:

First, input all the dataframe variables into the recipe, second, use the remove_role() function to select all the variable you do not want as predictors (consider that if you do not do this, all dataframe variables will be considered as predictors in model) and third, do the pre-processing as planned.

 simple_ames <- recipe(Sale_Price ~ ., data = ames_train) %>% 
  remove_role(-ends_with('SF'), -c('Neighborhood', 'Gr_Liv_Area', 
 'Year_Built', 'Bldg_Type', 'Latitude'), old_role = 'predictor') 
 %>% 
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_interact(~ Gr_Liv_Area:starts_with('Bldg_Type_')) %>% 
  step_ns(Latitude, deg_free = 20) %>% 
  step_pca(ends_with("SF") , contains("Gr_Liv"))