I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty
should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes
column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is
fit
and for it to work you'll have to provide the data for the fitting process (herediamonds
). The tradeoff is that you don't have toprep
your recipe as the workflow will take care of this itself.So reducing your code slightly, the example below will work.
As for crossvalidation one has to use
fit_resamples
and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being
glmnet
. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to theworkflows
package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.As for crossvalidation, assume you instead fit using any of the other engines described in
?linear_reg
then we could do this asand in the case where metrics are of interest these can indeed be collected as you did using
collect_metrics
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use
collect_predictions
Note however that the output from this is the predictions from each
fold
as you are literally fitting 10 models.Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (
collect_metrics
) would then be selected for use and thetest
dataset would be used to get the evaluation of this models accuracy. This is all described in TMwR chapter 10 & 11