k fold cross validation in purrr and model

836 views Asked by At

I came across this example

library(mtcars)
set.seed(17)
cv.error.10 = rep(0,10)
for (i in 1:10){
    glm.fit = glm(mpg∼poly(horsepower ,i),data=Auto)
    cv.error.10[i] = cv.glm(Auto,glm.fit,K=10)$delta[1]
}

cv.error.10
[1] 24.21 19.19 19.31 19.34 18.88 19.02 18.90 19.71 18.95 19.50

I have been trying to pick up purrr and modelr. This seemed like a good example to try to replicate as it includes both a loop and cross validation. How would I convert this code to something more tidy verse like?

Update

With the below suggestions, this is where the code is at

data(mtcars)
cv_mtcars = mtcars %>%
  crossv_kfold(k = 5)
cv_models = cv_mtcars %>%
  mutate(model = map(train, ~lm(mpg ~ hp, data = .)),
         rmse_all_models = map2_dbl(model, test, ~rmse(.x, .y)))
print(cv_models)

What I would like to do is repeat this for increasing polynomials of hp such as hp^2, hp^3 etc. I am guessing there is a purr way to do this.

Update 2

Here is an example of the un-iterated code

data(mtcars)
cv_mtcars = mtcars %>%
  crossv_kfold(k = 5)
cv_models = cv_mtcars %>%
  mutate(model1 = map(train, ~lm(mpg ~ hp, data = .)),
         model2 = map(train, ~lm(mpg ~I(hp^2), data = .)),
         model3 = map(train, ~lm(mpg ~I(hp^3), data = .)),         
         model4 = map(train, ~lm(mpg ~I(hp^4), data = .)),
         model5 = map(train, ~lm(mpg ~I(hp^5), data = .)),
         model6 = map(train, ~lm(mpg ~I(hp^6), data = .)),
         rmse_all_models1 = map2_dbl(model1, test, ~rmse(.x, .y)),
         rmse_all_models2 = map2_dbl(model2, test, ~rmse(.x, .y)),
         rmse_all_models3 = map2_dbl(model3, test, ~rmse(.x, .y)),
         rmse_all_models4 = map2_dbl(model4, test, ~rmse(.x, .y)),
         rmse_all_models5 = map2_dbl(model5, test, ~rmse(.x, .y)),
         rmse_all_models6 = map2_dbl(model6, test, ~rmse(.x, .y)))
print(cv_models)
1

There are 1 answers

2
brodrigues On BEST ANSWER

I don't know the mtcars library but if you need access to the mtcars data, you can use the following:

data(mtcars)
library(tidyverse)
library(modelr)

Then you can create a list of resamples with cross_mc()

cv_mtcars = mtcars %>%
  crossv_mc(n = 50)

print(cv_mtcars)

Now you can train your model on the resamples. train is the column holding the data frames for training. I use mutate() to a column called model were I mapped the lm() function (or any other model) to the data.

cv_models = cv_mtcars %>%
  mutate(model = map(train, ~lm(mpg ~ horsepower, data = .)))

print(cv_models)

You can add the root mean square error with the rmse() function from modelr:

rmse_cv = cv_models %>%
  mutate(rmse_all_models = map2_dbl(model, test, ~rmse(.x, .y))) %>%
  pull(rmse_all_models)

print(rmse_cv)

You can then compute any statistic of the rmse() you need. If you're not familiar with the concept of list columns, this code can bea bit overwhelming. You can read more about list columns here: https://campus.datacamp.com/courses/exploratory-data-analysis-in-r-case-study/tidy-modeling-with-broom?ex=10&_escaped_fragment_=#skiponboarding

I'm on a public computer so I could not try the code, but it should work.

Update

So I misunderstood the question a little, here is some more pointers:

powers = seq(1:6)

create_form = function(power){
  rhs = substitute(I(hp^pow), list(pow=power))
  rlang::new_formula(quote(mpg), rhs)
}

This function creates formulae, and then you can map a sequence of powers to this function:

list_forms = map(seq(1,6), create_form)

And then map the resulting list to lm:

map(list_forms, lm, data=mtcars)

To integrate this in a pipe workflow you need to create a new function:

train_model = function(cv_data, form){
  cv_data %>%
  mutate(model = map(train, ~lm(form, data = .)))
}

Test it on one model:

test = train_model(cv_mtcars, list_forms[[1]])

And now run it on everything:

all_models = map(list_forms, train_model, cv_data=cv_mtcars)

Hope this helps.