Why do we need prep, bake, and juice in tidymodels?

1.4k views Asked by At

I always finish up my model to fit and predict without using prep(), bake(), or juice():

rec_wflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(rec)

data_fit <- 
  rec_wflow %>% 
  fit(data = train_data)

Are these ( prep, bake, juice ) functions only used to visually check the preprocessing results of the data and not necessary for the fitting/training process?

What is the difference among prep/bake/juice in the R package "recipes"?

The above code is how I learned it in the official tutorial.

I've read in another blog that if you use train_data, data leakage is generated. I'd like to hear more about that; are these functions related to data leakage?

1

There are 1 answers

2
neilfws On BEST ANSWER

Short answer: you are correct, when a recipe is used in a workflow as in your example, the pre-processing functions are not required.

This is touched on in the tutorial Handle class imbalance in #TidyTuesday climbing expedition data with tidymodels:

We’re going to use this recipe in a workflow() so we don’t need to stress a lot about whether to prep() or not. If you want to explore the what the recipe is doing to your data, you can first prep() the recipe to estimate the parameters needed for each step and then bake(new_data = NULL) to pull out the training data with those steps applied.

I recommend all the tutorials at Julia's blog for understanding tidymodels.