Writing a function in R to group by variable column from a data frame

3.5k views Asked by At

I am trying to write a function that will allow me to produce descriptive statistics by grouping across multiple factors in a data frame. I have spent way too many hours trying to get my function to recognize the by variables I am selecting.

Here is the fake data:

grouping1 <- c("red", "blue", "blue", "green", "red", "blue", "red", "green")                 
grouping2 <- c("high", "high", "low", "medium", "low", "high", "medium", "high")                  
value <- c(22,40,72,41,36,16,88,99)

fake_df <- data.frame(grouping1, grouping2, value)

Fake code example:

library(dplyr)

by_group_fun <- function(fun.data.in, fun.grouping.factor){
  fake_df2 <- fun.data.in %>%
    group_by(fun.grouping.factor) %>%
    summarize(mean = mean(value), median = median(value))
  fake_df2
}
by_group_fun(fake_df, grouping1) 
by_group_fun(fake_df, grouping2) 

This gives me:

 Error in grouped_df_impl(data, unname(vars), drop) : 
  Column `fun.grouping.factor` is unknown

Second try

I tried to assign the by variable selected in the function to a new variable and carry that forward.

Fake code example (second try):

by_group_fun2 <- function(fun.data.in, fun.grouping.factor){
  fun.data.in$by_var <- fun.data.in$fun.grouping.factor

  fake_df2 <- fun.data.in %>%
    group_by(by_var) %>%
    summarize(mean = mean(value), median = median(value))
  fake_df2
}

by_group_fun2(fake_df, grouping1) 
by_group_fun2(fake_df, grouping2) 

This, the second try, gives me:

 Error in grouped_df_impl(data, unname(vars), drop) : 
  Column `by_var` is unknown
2

There are 2 answers

0
CPak On BEST ANSWER

Use this example to guide you

myfun <- function(df, thesecols) {
              require(dplyr)
              thesecols <- enquo(thesecols)    # need to quote
              df %>%
                group_by_at(vars(!!thesecols))  # !! unquotes
         }

myfun(fake_df, grouping1)

Output

# A tibble: 8 x 3
# Groups:   grouping1 [3]
  grouping1 grouping2 value
     <fctr>    <fctr> <dbl>
1       red      high    22
2      blue      high    40
3      blue       low    72
4     green    medium    41
5       red       low    36
6      blue      high    16
7       red    medium    88
8     green      high    99
0
alistaire On

A really simple way to get the same output without resorting to programming with dplyr is to gather the grouping columns to long form. Grouping by both the resulting key and value columns will get all the combinations you're asking for without moving beyond a single data.frame:

library(tidyverse)

fake_df <- data_frame(grouping1 = c("red", "blue", "blue", "green", "red", "blue", "red", "green"),
                      grouping2 = c("high", "high", "low", "medium", "low", "high", "medium", "high"),
                      value = c(22,40,72,41,36,16,88,99))

fake_df %>% 
    gather(group_var, group_val, -value) %>% 
    group_by(group_var, group_val) %>% 
    summarise(mean = mean(value), 
              median = median(value))
#> # A tibble: 6 x 4
#> # Groups:   group_var [?]
#>   group_var group_val     mean median
#>       <chr>     <chr>    <dbl>  <dbl>
#> 1 grouping1      blue 42.66667   40.0
#> 2 grouping1     green 70.00000   70.0
#> 3 grouping1       red 48.66667   36.0
#> 4 grouping2      high 44.25000   31.0
#> 5 grouping2       low 54.00000   54.0
#> 6 grouping2    medium 64.50000   64.5