modelr::bootstrap or broom::bootstrap and grouping problems

944 views Asked by At

I have one long dataset that is composed of several datasets resulting from multiple imputations (let's say 10 imputations). They have an id variable identifying the imputation. On each of these imputed datasets I would like to bootstrap 10 datasets. After the bootstrap, I want to run models on each (100, imputation bootstrap combinations).

In this example I am not sure whether to use the broom::bootstrap() function or the modelr::bootstrap() function. Furthermore, the grouping seems to be lost in my pipeline.

Here is a reproducible example using the mtcars dataset:

library(tidyverse)
library(broom)

cars <- mtcars %>%
  mutate(am = as.factor(am)) %>% # This is standing in for my imputation id variable
  group_by(am) 

Source: local data frame [32 x 11]
Groups: am [2]

# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0      1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0      1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1      1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1      0     3     1
 5  18.7     8 360.0   175  3.15 3.440 17.02     0      0     3     2

As you can see the output is currently showing that there are two groups, as it should. In my dataset it would show there are 10, for each imputed dataset. Now:

cars2 <- cars %>%
  broom::bootstrap(10, by_group = TRUE)

cars2

Source: local data frame [32 x 11]
Groups: replicate [10]

# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0      1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0      1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1      1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1      0     3     1

Now it looks as though there are only 10 groups representing each replicate. It didn't seem to preserve the prior grouping. At this point I would expect 20 total groups (2 x 10).

If I now do this:

cars3 <- cars2 %>%
  group_by(am)

cars3

Source: local data frame [32 x 11]
Groups: am [2]

# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0      1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0      1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1      1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1      0     3     1

Now it seems like there are no replicates only groups for am.

Is there anyway to do the bootstrapping after i've grouped my original dataset. Also, ideally, after I bootstrap there should be an id that indicates which bootstrapped dataset i'm looking at.

In my ideal world my code should be able to do something like this:

cars <- mtcars %>%
  mutate(am = as.factor(am)) %>%
  group_by(am) %>%
  bootstrap(10, by_group = TRUE) %>%
  nest() %>% # create a condensed tidy dataset that has one row per imputation, bootstrap combo
  mutate(model = map(data, ~lm(mpg~, data = .)) # Create a model for each row
1

There are 1 answers

0
Brian On

I'm in the midst of trying to learn both modelr and purrr and they're really giving me a headache. I think I finally figured this one out though.

library(modelr)
library(dplyr)
library(tidyr)
library(broom)

Group the dataframe, then within each group, create 10 nested bootstrap replicates

mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 10)) 
Source: local data frame [2 x 2]
Groups: <by row>

# A tibble: 2 x 2
     am                rs
* <dbl>            <list>
1     0 <tibble [10 x 2]>
2     1 <tibble [10 x 2]>

Regroup and unnest to expand into 2 columns for the bootstraps and an id

mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 10)) %>% 
  group_by(am) %>% 
  unnest
# A tibble: 20 x 3
# Groups:   am [2]
      am          strap   .id
   <dbl>         <list> <chr>
 1     0 <S3: resample>    01
 2     0 <S3: resample>    02
 3     0 <S3: resample>    03
 4     0 <S3: resample>    04
 5     0 <S3: resample>    05
 6     0 <S3: resample>    06
 7     0 <S3: resample>    07
 8     0 <S3: resample>    08
 9     0 <S3: resample>    09
10     0 <S3: resample>    10
11     1 <S3: resample>    01
12     1 <S3: resample>    02
13     1 <S3: resample>    03
14     1 <S3: resample>    04
15     1 <S3: resample>    05
16     1 <S3: resample>    06
17     1 <S3: resample>    07
18     1 <S3: resample>    08
19     1 <S3: resample>    09
20     1 <S3: resample>    10

Group down to the lowest level of replication and create models

You have to use as.data.frame on the strap column to re-expand it to usable data. See ?resample. This one took me forever to figure out. It should just work like tidyr::unnest.

mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 10)) %>% 
  group_by(am) %>% 
  unnest %>% 
  group_by(am, .id) %>% 
  do(model = lm(mpg~wt, data = as.data.frame(.$strap)))
Source: local data frame [20 x 3]
Groups: <by row>

# A tibble: 20 x 3
      am   .id    model
 * <dbl> <chr>   <list>
 1     0    01 <S3: lm>
 2     0    02 <S3: lm>
 3     0    03 <S3: lm>
 4     0    04 <S3: lm>
 5     0    05 <S3: lm>
 6     0    06 <S3: lm>
 7     0    07 <S3: lm>
 8     0    08 <S3: lm>
 9     0    09 <S3: lm>
10     0    10 <S3: lm>
11     1    01 <S3: lm>
12     1    02 <S3: lm>
13     1    03 <S3: lm>
14     1    04 <S3: lm>
15     1    05 <S3: lm>
16     1    06 <S3: lm>
17     1    07 <S3: lm>
18     1    08 <S3: lm>
19     1    09 <S3: lm>
20     1    10 <S3: lm>

Call your function/summary on each model

mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 10)) %>% 
  group_by(am) %>% 
  unnest %>% 
  group_by(am, .id) %>% 
  do(model = lm(mpg~wt, data = as.data.frame(.$strap))) %>% 
  tidy(model)
# A tibble: 40 x 7
# Groups:   am, .id [20]
      am   .id        term  estimate std.error statistic      p.value
   <dbl> <chr>       <chr>     <dbl>     <dbl>     <dbl>        <dbl>
 1     0    01 (Intercept) 25.800592 2.1055145 12.253818 7.300379e-10
 2     0    01          wt -2.608827 0.5377694 -4.851201 1.497729e-04
 3     0    02 (Intercept) 37.012664 4.7369213  7.813654 5.023424e-07
 4     0    02          wt -5.272094 1.2884870 -4.091693 7.602571e-04
 5     0    03 (Intercept) 26.145563 2.2114832 11.822637 1.263234e-09
 6     0    03          wt -2.428845 0.5541412 -4.383080 4.056524e-04
 7     0    04 (Intercept) 31.502481 4.0753463  7.730013 5.806324e-07
 8     0    04          wt -3.584863 1.1510368 -3.114464 6.305972e-03
 9     0    05 (Intercept) 31.739921 2.2216473 14.286661 6.690920e-11
10     0    05          wt -3.716515 0.5627808 -6.603841 4.471168e-06
# ... with 30 more rows

Visualize

Note that I upped the number of bootstraps to 1000, takes about 10s.

mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 1000)) %>% 
  group_by(am) %>% 
  unnest %>% 
  group_by(am, .id) %>% 
  do(model = lm(mpg~wt, data = as.data.frame(.$strap))) %>% 
  tidy(model) %>% 
  ggplot(aes(estimate)) + geom_histogram() +
  facet_grid(am ~ term, scales = "free_x", labeller = label_both)

enter image description here