How to incorporate tidy models PCA into the workflow of a model and make predictions

409 views Asked by At

I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.

I have tried the following approach,

diamonds <- diamonds %>%
  select(-clarity, -cut, - color)

diamonds_split <- initial_split(diamonds, prop = 4/5)

diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

diamonds_test <-vfold_cv(diamonds_train)

diamonds_recipe <- 
  # La fórmula básica y todos los datos (outcome ~ predictors)
  recipe(price ~ ., data = diamonds_train) %>%
  step_log(all_outcomes(),skip = T) %>%
  step_normalize(all_predictors(), -all_nominal()) %>% 
  step_pca(all_predictors())

preprocesados <- prep(diamonds_recipe)

linear_model <- 
  linear_reg() %>%
  set_engine("glmnet") %>%
  set_mode("regression")

pca_workflow <- workflow() %>%
  add_recipe(diamonds_recipe) %>%
  add_model(linear_model)

lr_fitted_workflow <-  pca_workflow %>%  #option A workflow full dataset
  last_fit(diamonds_split)

performance <- lr_fitted_workflow %>% collect_metrics()

test_predictions <- lr_fitted_workflow %>% collect_predictions()

But I get this error:

x Resample1: model (predictions): Error: penalty should be a single numeric value. ... Warning message: “All models failed in [fit_resamples()]. See the .notes column.”

Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:

pca_fit <- juice(preprocesados) %>%  #option C no work flow at all
  lm(price ~ ., data = .)

prep_test <- prep(diamonds_recipe, new_data = diamonds_test)

truths <- juice(prep_test) %>%
          select(price)

ans <- predict(pca_fit, new_data = prep_test)

tib <- tibble(row = 1:length(ans),ans, truths)

ggplot(data = tib) +
  geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
  geom_smooth(mapping = aes(x = row, y = price, colour = "true")) 

And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D

1

There are 1 answers

0
Oliver On

The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.

So reducing your code slightly, the example below will work.

library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
  select(-clarity, -cut, - color)

diamonds_split <- initial_split(diamonds, prop = 4/5)

diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

diamonds_recipe <- 
  # La fórmula básica y todos los datos (outcome ~ predictors)
  recipe(price ~ ., data = diamonds_train) %>%
  step_log(all_outcomes(),skip = T) %>%
  step_normalize(all_predictors(), -all_nominal()) %>% 
  step_pca(all_predictors())

linear_model <- 
  linear_reg() %>%
  set_engine("glmnet") %>%
  set_mode("regression")

pca_workflow <- workflow() %>%
  add_recipe(diamonds_recipe) %>%
  add_model(linear_model)

pca_fit <- fit(pca_workflow, data = diamonds_train)

As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)

Edit

Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.

As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as

linear_model_base <- 
  linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)

and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics

pca_folds_fit %>% collect_metrics()

If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions

pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)

Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.

Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy. This is all described in TMwR chapter 10 & 11