Spread multiple columns in a function

3.4k views Asked by At

Often I need to spread multiple value columns, as in this question. But I do it often enough that I'd like to be able to write a function that does this.

For example, given the data:

set.seed(42)
dat <- data_frame(id = rep(1:2,each = 2),
                  grp = rep(letters[1:2],times = 2),
                  avg = rnorm(4),
                  sd = runif(4))
> dat
# A tibble: 4 x 4
     id   grp        avg        sd
  <int> <chr>      <dbl>     <dbl>
1     1     a  1.3709584 0.6569923
2     1     b -0.5646982 0.7050648
3     2     a  0.3631284 0.4577418
4     2     b  0.6328626 0.7191123

I'd like to create a function that returns something like:

# A tibble: 2 x 5
     id     a_avg      b_avg      a_sd      b_sd
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

How can I do that?

3

There are 3 answers

2
joran On BEST ANSWER

We'll return to the answer provided in the question linked to, but for the moment let's start with a more naive approach.

One idea would be to spread each value column individually, and then join the results, i.e.

library(dplyr)
library(tidyr)
library(tibble)

dat_avg <- dat %>% 
    select(-sd) %>%
    spread(key = grp,value = avg) %>%
    rename(a_avg = a,
           b_avg = b)

dat_sd <- dat %>% 
    select(-avg) %>%
    spread(key = grp,value = sd) %>%
    rename(a_sd = a,
           b_sd = b)

> full_join(dat_avg,
          dat_sd,
          by = 'id')

# A tibble: 2 x 5
     id     a_avg      b_avg      a_sd      b_sd
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

(I used a full_join just in case we run into situations where not all combinations of the join columns appear in all of them.)

Let's start with a function that works like spread but allows you to pass the key and value columns as characters:

spread_chr <- function(data, key_col, value_cols, fill = NA, 
                       convert = FALSE,drop = TRUE,sep = NULL){
    n_val <- length(value_cols)
    result <- vector(mode = "list", length = n_val)
    id_cols <- setdiff(names(data), c(key_col,value_cols))

    for (i in seq_along(result)){
        result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
                              key = !!key_col,
                              value = !!value_cols[i],
                              fill = fill,
                              convert = convert,
                              drop = drop,
                              sep = paste0(sep,value_cols[i],sep))
    }

    result %>%
        purrr::reduce(.f = full_join, by = id_cols)
}

> dat %>%
  spread_chr(key_col = "grp",
             value_cols = c("avg","sd"),
             sep = "_")

# A tibble: 2 x 5
     id grp_avg_a  grp_avg_b  grp_sd_a  grp_sd_b
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

The key ideas here are to unquote the arguments key_col and value_cols[i] using the !! operator, and using the sep argument in spread to control the resulting value column names.

If we wanted to convert this function to accept unquoted arguments for the key and value columns, we could modify it like so:

spread_nq <- function(data, key_col,..., fill = NA, 
                      convert = FALSE, drop = TRUE, sep = NULL){
    val_quos <- rlang::quos(...)
    key_quo <- rlang::enquo(key_col)
    value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
    key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))

    n_val <- length(value_cols)
    result <- vector(mode = "list",length = n_val)
    id_cols <- setdiff(names(data),c(key_col,value_cols))

    for (i in seq_along(result)){
        result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
                              key = !!key_col,
                              value = !!value_cols[i],
                              fill = fill,
                              convert = convert,
                              drop = drop,
                              sep = paste0(sep,value_cols[i],sep))
    }

    result %>%
        purrr::reduce(.f = full_join,by = id_cols)
}

> dat %>%
  spread_nq(key_col = grp,avg,sd,sep = "_")

# A tibble: 2 x 5
     id grp_avg_a  grp_avg_b  grp_sd_a  grp_sd_b
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

The change here is that we capture the unquoted arguments with rlang::quos and rlang::enquo and then simply convert them back to characters using tidyselect::vars_select.

Returning to the solution in the linked question that uses a sequence of gather, unite and spread, we can use what we've learned to make a function like this:

spread_nt <- function(data,key_col,...,fill = NA,
                      convert = TRUE,drop = TRUE,sep = "_"){
  key_quo <- rlang::enquo(key_col)
  val_quos <- rlang::quos(...)
  value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
  key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))

  data %>%
    gather(key = ..var..,value = ..val..,!!!val_quos) %>%
    unite(col = ..grp..,c(key_col,"..var.."),sep = sep) %>%
    spread(key = ..grp..,value = ..val..,fill = fill,
           convert = convert,drop = drop,sep = NULL)
}

> dat %>%
  spread_nt(key_col = grp,avg,sd,sep = "_")

# A tibble: 2 x 5
     id     a_avg      a_sd      b_avg      b_sd
* <int>     <dbl>     <dbl>      <dbl>     <dbl>
1     1 1.3709584 0.6569923 -0.5646982 0.7050648
2     2 0.3631284 0.4577418  0.6328626 0.7191123

This relies on the same techniques from rlang from the last example. We're using some unusual names like ..var.. for our intermediate variables in order to reduce the chances of name collisions with existing columns in our data frame.

Also, we're using the sep argument in unite to control the resulting column names, so in this case when we spread we force sep = NULL.

0
guyabel On

Since tidyr version 1.0.0

tidyr::pivot_wider(data = dat, id_cols = id, names_from = grp, values_from = avg:sd) 
# # A tibble: 2 x 5
#      id avg_a  avg_b  sd_a  sd_b
#   <int> <dbl>  <dbl> <dbl> <dbl>
# 1     1 1.37  -0.565 0.657 0.705
# 2     2 0.363  0.633 0.458 0.719
0
moodymudskipper On

Spreading operations can also be done by unnesting a properly reformated table, here's an alternative using tidyverse :

# helper function that returns an horizontal one lined named tibble wrapped into a list
lhframe <- function(x,nms) list(setNames(as_tibble(t(x)),nms))
dat %>% group_by(id) %>%
  summarize(avg = lhframe(avg,grp),
            sd  = lhframe(sd,grp)) %>%
  unnest(.sep="_")

# # A tibble: 2 x 5
#      id      avg_a     avg_b      sd_a      sd_b
#   <int>      <dbl>     <dbl>     <dbl>     <dbl>
# 1     1 -1.7631631 0.4600974 0.7595443 0.5664884
# 2     2 -0.6399949 0.4554501 0.8496897 0.1894739

Unfortunately the following doesn't work:

dat %>% group_by(id) %>%
  summarize_at(vars(avg,sd),lhframe,grp) %>%
  unnest(.sep="_")