Fit model using each predictor columns indiviually store results in dataframe

Question

Fit model using each predictor columns indiviually store results in dataframe

2.2k views Asked by nofunsally At 03 January 2025 at 06:15

I have a dataframe with one column of a response variable, and several columns of predictor variables. I want to fit models for the response variable using each of the predictor variables separately, finally creating a dataframe that contains the coefficients of the model. Previously, I would have done this:

data(iris)

iris_vars <- c("Sepal.Width", "Petal.Length", "Petal.Width")
fits.iris <- lapply(iris_vars, function(x) {lm(substitute(Sepal.Length ~ i, list(i = as.name(x))), data = iris)})

# extract model coeffs, so forth and so on, eventually combining into a result dataframe
iris.p <- as.data.frame(lapply(fits.iris, function(f) summary(f)$coefficients[,4]))
iris.r <- as.data.frame(lapply(fits.iris, function(f) summary(f)$r.squared))

However, this seems a little cumbersome now that I have begun to use dplyr, broom, etc. Using purrr::map I can more or less recreate this list of models:

# using purrr, still uses the Response variable "Sepal.Length" as a predictor of itself
iris %>%
select(1:4) %>%
# names(select(., 2:4)) %>% this does not work 
names() %>%
paste('Sepal.Length ~', .) %>% 
map(~lm(as.formula(.x), data = iris))

However, I am unsure how to get this list into an appropriate form to use with broom::tidy. If I was doing using grouped rows, and not columns, I would store the model fits and use broom::tidy to do something like this:

iris.fits <- group_by(Species) %>% do(modfit1 = lm(Sepal.Length~Sepal.Width,data=.))
tidy(iris.fits, modfit1)

Of course this isn't what I am doing, but I was hoping there was similar procedure when using columns of data. Is there way, perhaps to use purrr::nest or something similar to create the desired output?

Original Q&A

There are 4 answers

Julia Silge On 29 December 2016 at 04:14

If you are up for a little work in setting up a quasi-nested data frame with list-columns to get started, the map/model/unnest/tidy step falls out quite nicely.

First, set up your data frame:

> library(dplyr)
> 
> nested_df <- data_frame(data = list(iris %>% 
                                          select(response = Sepal.Length, 
                                                 predictor = Sepal.Width), 
                                      iris %>% 
                                          select(response = Sepal.Length, 
                                                 predictor = Petal.Length),
                                      iris %>% 
                                          select(response = Sepal.Length, 
                                                 predictor = Petal.Width)))
> 
> nested_df
# A tibble: 3 × 1
                    data
                  <list>
1 <data.frame [150 × 2]>
2 <data.frame [150 × 2]>
3 <data.frame [150 × 2]>

Now use purrr, tidyr, and broom to get out the results of the modeling.

> library(tidyr)
> library(purrr)
> library(broom)
> 
> nested_df %>%
      mutate(models = map(data, ~ lm(response ~ predictor, .))) %>%
      unnest(map(models, tidy))
# A tibble: 6 × 5
         term   estimate  std.error statistic       p.value
        <chr>      <dbl>      <dbl>     <dbl>         <dbl>
1 (Intercept)  6.5262226 0.47889634 13.627631  6.469702e-28
2   predictor -0.2233611 0.15508093 -1.440287  1.518983e-01
3 (Intercept)  4.3066034 0.07838896 54.938900 2.426713e-100
4   predictor  0.4089223 0.01889134 21.646019  1.038667e-47
5 (Intercept)  4.7776294 0.07293476 65.505517 3.340431e-111
6   predictor  0.8885803 0.05137355 17.296454  2.325498e-37

You can use filter to extract out just the slopes (term == "predictor") or you could use glance instead of tidy in the last line of code to get those results.

LmW. On 14 May 2017 at 01:33

My answer is similar in spirit to Julia Silge and wysiwyg's, but I want to avoid manually typing variable names and keep the names for response and predictor in the formula of the model object:

require(tibble)
require(dplyr)
require(tidyr)
require(purrr)
require(broom)

df <- iris
response_var <- "Sepal.Length"

vars <- tibble(response=response_var,
               predictor=setdiff(names(df), response_var))

compose_formula <- function(x, y)
  as.formula(paste0("~lm(", y, "~", x, ", data=.)"))

models <- tibble(data=list(df)) %>%
           crossing(vars) %>%
           mutate(fmla = map2(predictor, response, compose_formula),
                  model = map2(data, fmla, ~at_depth(.x, 0, .y)))

models %>% unnest(map(model, tidy))
models %>% unnest(map(model, glance), .drop=T)

Output:

# A tibble: 9 x 7
      response    predictor              term   estimate  std.error statistic
         <chr>        <chr>             <chr>      <dbl>      <dbl>     <dbl>
1 Sepal.Length  Sepal.Width       (Intercept)  6.5262226 0.47889634 13.627631
2 Sepal.Length  Sepal.Width       Sepal.Width -0.2233611 0.15508093 -1.440287
3 Sepal.Length Petal.Length       (Intercept)  4.3066034 0.07838896 54.938900
4 Sepal.Length Petal.Length      Petal.Length  0.4089223 0.01889134 21.646019
5 Sepal.Length  Petal.Width       (Intercept)  4.7776294 0.07293476 65.505517
6 Sepal.Length  Petal.Width       Petal.Width  0.8885803 0.05137355 17.296454
7 Sepal.Length      Species       (Intercept)  5.0060000 0.07280222 68.761639
8 Sepal.Length      Species Speciesversicolor  0.9300000 0.10295789  9.032819
9 Sepal.Length      Species  Speciesvirginica  1.5820000 0.10295789 15.365506
# ... with 1 more variables: p.value <dbl>

and

# A tibble: 4 x 13
      response    predictor  r.squared adj.r.squared     sigma  statistic
         <chr>        <chr>      <dbl>         <dbl>     <dbl>      <dbl>
1 Sepal.Length  Sepal.Width 0.01382265   0.007159294 0.8250966   2.074427
2 Sepal.Length Petal.Length 0.75995465   0.758332718 0.4070745 468.550154
3 Sepal.Length  Petal.Width 0.66902769   0.666791387 0.4779948 299.167312
4 Sepal.Length      Species 0.61870573   0.613518054 0.5147894 119.264502
# ... with 7 more variables: p.value <dbl>, df <int>, logLik <dbl>, AIC <dbl>,
#   BIC <dbl>, deviance <dbl>, df.residual <int>

Matt Mulvahill On 06 February 2017 at 23:31

Another option for the quasi-nested dataframe that would make quick work of a larger number of predictor variables would be to use purrr::map_df. This can also give you the predictor as a column in the enclosing dataframe. Then follow Julia's example for fitting the models.

> library(dplyr)
> library(purrr)
>
> def_nested_df <- function(x) {
    data_frame("covariate" = x,
               "data" = list(iris %>% tbl_df %>%
                               select_("response" = "Sepal.Length", 
                                       "predictor" = x)))
  }    
> 
> nested_df <- 
    c("Sepal.Width", "Petal.Length", "Petal.Width") %>%
    map_df(def_nested_df)
>
> nested_df
# A tibble: 3 × 2
     covariate               data
         <chr>             <list>
1  Sepal.Width <tibble [150 × 2]>
2 Petal.Length <tibble [150 × 2]>
3  Petal.Width <tibble [150 × 2]>
>
> nested_df[[1, "data"]]
# A tibble: 150 × 2
   response predictor
      <dbl>     <dbl>
1       5.1       3.5
2       4.9       3.0
3       4.7       3.2
4       4.6       3.1
5       5.0       3.6
6       5.4       3.9
7       4.6       3.4
8       5.0       3.4
9       4.4       2.9
10      4.9       3.1
# ... with 140 more rows

**G. Grothendieck** · Accepted Answer · 2016-12-20T17:11:23+00:00

1) This gives the glance and tidy output for the model fits:

library(broom)

make_model <- function(nm) lm(iris[c("Sepal.Length", nm)])
fits <- Map(make_model, iris_vars)

glance_tidy <- function(x) c(unlist(glance(x)), unlist(tidy(x)[, -1]))
out <- sapply(fits, glance_tidy)

1a) or as a magrittr pipeline:

library(magrittr)
out <- iris_vars %>% Map(f = make_model) %>% sapply(glance_tidy)

Either one gives the following matrix:

> out
                Sepal.Width   Petal.Length    Petal.Width
r.squared      1.382265e-02   7.599546e-01   6.690277e-01
adj.r.squared  7.159294e-03   7.583327e-01   6.667914e-01
sigma          8.250966e-01   4.070745e-01   4.779948e-01
statistic      2.074427e+00   4.685502e+02   2.991673e+02
p.value        1.518983e-01   1.038667e-47   2.325498e-37
df             2.000000e+00   2.000000e+00   2.000000e+00
logLik        -1.829958e+02  -7.702021e+01  -1.011107e+02
AIC            3.719917e+02   1.600404e+02   2.082215e+02
BIC            3.810236e+02   1.690723e+02   2.172534e+02
deviance       1.007561e+02   2.452503e+01   3.381489e+01
df.residual    1.480000e+02   1.480000e+02   1.480000e+02
estimate1      6.526223e+00   4.306603e+00   4.777629e+00
estimate2     -2.233611e-01   4.089223e-01   8.885803e-01
std.error1     4.788963e-01   7.838896e-02   7.293476e-02
std.error2     1.550809e-01   1.889134e-02   5.137355e-02
statistic1     1.362763e+01   5.493890e+01   6.550552e+01
statistic2    -1.440287e+00   2.164602e+01   1.729645e+01
p.value1       6.469702e-28  2.426713e-100  3.340431e-111
p.value2       1.518983e-01   1.038667e-47   2.325498e-37

or transposed:

> t(out)
              r.squared adj.r.squared     sigma  statistic      p.value df
Sepal.Width  0.01382265   0.007159294 0.8250966   2.074427 1.518983e-01  2
Petal.Length 0.75995465   0.758332718 0.4070745 468.550154 1.038667e-47  2
Petal.Width  0.66902769   0.666791387 0.4779948 299.167312 2.325498e-37  2
                 logLik      AIC      BIC  deviance df.residual estimate1
Sepal.Width  -182.99584 371.9917 381.0236 100.75610         148  6.526223
Petal.Length  -77.02021 160.0404 169.0723  24.52503         148  4.306603
Petal.Width  -101.11073 208.2215 217.2534  33.81489         148  4.777629
              estimate2 std.error1 std.error2 statistic1 statistic2
Sepal.Width  -0.2233611 0.47889634 0.15508093   13.62763  -1.440287
Petal.Length  0.4089223 0.07838896 0.01889134   54.93890  21.646019
Petal.Width   0.8885803 0.07293476 0.05137355   65.50552  17.296454
                  p.value1     p.value2
Sepal.Width   6.469702e-28 1.518983e-01
Petal.Length 2.426713e-100 1.038667e-47
Petal.Width  3.340431e-111 2.325498e-37

2) If we remove the first unlist from the glance_tidy function definition then we get a 2d list (rather than a 2d numeric matrix):

glance_tidy_l <- function(x) c(glance(x), unlist(tidy(x)[, -1]))
iris_vars %>% Map(f = make_model) %>% sapply(glance_tidy_l)

              Sepal.Width  Petal.Length  Petal.Width  
r.squared     0.01382265   0.7599546     0.6690277    
adj.r.squared 0.007159294  0.7583327     0.6667914    
sigma         0.8250966    0.4070745     0.4779948    
statistic     2.074427     468.5502      299.1673     
p.value       0.1518983    1.038667e-47  2.325498e-37 
df            2            2             2            
logLik        -182.9958    -77.02021     -101.1107    
AIC           371.9917     160.0404      208.2215     
BIC           381.0236     169.0723      217.2534     
deviance      100.7561     24.52503      33.81489     
df.residual   148          148           148          
estimate1     6.526223     4.306603      4.777629     
estimate2     -0.2233611   0.4089223     0.8885803    
std.error1    0.4788963    0.07838896    0.07293476   
std.error2    0.1550809    0.01889134    0.05137355   
statistic1    13.62763     54.9389       65.50552     
statistic2    -1.440287    21.64602      17.29645     
p.value1      6.469702e-28 2.426713e-100 3.340431e-111
p.value2      0.1518983    1.038667e-47  2.325498e-37

TechQA.

Fit model using each predictor columns indiviually store results in dataframe

There are 4 answers

Related Questions in R

Related Questions in DPLYR

Related Questions in PURRR

Related Questions in BROOM

Popular Questions

Popular Tags

Trending Questions