Is there a function in dplyr/forcats to display count and percentages from a dataframe of dichotomous variables?

440 views Asked by At

I frequently get stuck when I want to summarise categorial variables in my dataset. My dataset contains a dichotomous variables (yes/no) per patient. In the below example set , "A-C" are risk factors that the person does or does not have.

A <- c("yes", "no", "yes", "no", "yes")
B <- c("no", "no", "yes", "yes", "no")
C <- c("yes", "no", "yes", "no", "yes")

df <- data.frame(A, B, C)

what I am trying to do is to summarise all variables to factor level counts and percentages - with one line of code. I tried using apply, forcats, dplyr but can't get it right. Can anyone help me :)

I am hoping to get:

A : Yes 3 | %

No 2 | %

B: ..

C..

The ultimate goal is make a big summary table of baseline characteristics of a study population with both continous and categorical variables. Probably will try to use CBCgrps or tableone.

Thank you!

3

There are 3 answers

0
lotus On BEST ANSWER

You can use forcats::fct_count():

library(purrr)
library(forcats)

map_df(df, fct_count, prop = TRUE, .id = "var")

# A tibble: 6 x 4
  var   f         n     p
  <chr> <fct> <int> <dbl>
1 A     no        2   0.4
2 A     yes       3   0.6
3 B     no        3   0.6
4 B     yes       2   0.4
5 C     no        2   0.4
6 C     yes       3   0.6
0
Edo On

With Base R there is a pretty simple solution:

lapply(df, function(x){
 
 tb <- table(x)
 as.data.frame(cbind(n = tb, perc = tb / sum(tb)))
 
})
#> $A
#>     n perc
#> no  2  0.4
#> yes 3  0.6
#> 
#> $B
#>     n perc
#> no  3  0.6
#> yes 2  0.4
#> 
#> $C
#>     n perc
#> no  2  0.4
#> yes 3  0.6
0
Lsax On

I wonder if this tidyverse solution suits you. Pivot to long format, group by "groups" and "answer". Summarise counts cases within each combination of "group" and "answer", "answer" is then peeled off and percentage calculated by groups A,B and C. Ungrouping peels of "answers" so we can calculate percentage overall.

library(tidyverse)
A <- c("yes", "no", "yes", "no", "yes")
B <- c("no", "no", "yes", "yes", "no")
C <- c("yes", "no", "yes", "no", "yes")

df <- data.frame(A, B, C)
df %>%
  pivot_longer(cols = everything(), names_to = "group", values_to = "answer") %>%
  group_by(group, answer) %>%
  summarise(n = n()) %>%
  mutate(percent_by_group = scales::percent(n / sum(n))) %>% 
  ungroup() %>% 
  mutate(percent_overall=scales::percent(n / sum(n)))

This is the result

 # A tibble: 6 x 5
  group answer     n percent_by_group percent_overall
  <chr> <chr>  <int> <chr>            <chr>          
1 A     no         2 40%              13.3%          
2 A     yes        3 60%              20.0%          
3 B     no         3 60%              20.0%          
4 B     yes        2 40%              13.3%          
5 C     no         2 40%              13.3%          
6 C     yes        3 60%              20.0%