Create 2x2 Conditional Table from Counting Instances in Dataframe

181 views Asked by At

I have a dataframe that is broken down by the gender of an author, their role in a project, and an identifier (PMID) (see below).

What I need is to create a 2x2 contingency table so I can calculate an odds ratio for the association between being a female first author and having a female senior author. To get that, I need to calculate the following:

  • A: Number of times first author is female AND senior author is female
  • B: Number of times first author is female AND senior author is male
  • C: Number of times first author is male AND senior author is male
  • D: Number of times first author is male AND senior author is female (and obviously drop cases where there is only a senior or only a first author per PMID)

I have grouped the table (see below) by PMID, so I really just need to figure out how to count each instance of the above. Having a hard time, would greatly appreciate any help!


# A tibble: 178,056 x 3
# Groups:   pmid [101,907]
    gender authorship pmid    
    <chr>  <chr>      <chr>   
  1 male   First      18958667
  2 male   Senior     18958667
  3 male   First      18958651
  4 male   First      18751818
  5 male   Senior     18751818
  6 male   First      18751811
  7 male   Senior     18751811
  8 female First      18751810
  9 female Senior     18751810
 10 male   First      18088800
 11 male   Senior     18088800
 12 male   First      17710072
 13 female First      17977216
 14 male   Senior     17762065
 15 male   First      17611457
 16 male   First      17611433
 17 male   First      17532688
 18 male   Senior     17532688
 19 female First      17405310
 20 male   Senior     17386862
 21 female First      17319096
 22 male   Senior     17319096
 23 female First      17300028
 24 male   First      17282480
 25 female First      17177771
 26 male   First      17124681
 27 female First      17093906
 28 female First      17042011
 29 male   Senior     17042011
 30 female First      17042010
 31 male   Senior     17042010
 32 female First      17042006
 33 male   Senior     17042006
 34 female First      17042003
 35 female First      17042002
 36 male   Senior     17042002
 37 male   First      17042001
 38 female First      17041999
 39 male   Senior     17041997
 40 female First      17041995
 41 female First      17041994
 42 female First      17041993
 43 female Senior     17041993
 44 female First      17041992
 45 female Senior     17041992
 46 female First      17041991
 47 male   First      17041990
 48 male   Senior     17041990
 49 male   First      17041989
 50 male   Senior     17041989

2

There are 2 answers

0
Baraliuh On BEST ANSWER

Tidy solution with golf-coding:

library(tidyr)
contig_table <- mydf %>% 
  spread(authorship, gender) %>% 
  #Only need drop_na() if data is incomplete
  #Given that you have a lot more rows I assume this will not be needed
  drop_na() %$% 
  table(First, Senior)
0
Chuck P On

Here's a tidy solution by pivot_wider

library(dplyr)
library(tidyr)

newdf <- 
   mydf %>% 
   group_by(pmid) %>% 
   pivot_wider(names_from = authorship, 
               values_from = gender)

table(newdf$First, newdf$Senior)
#>         
#>          female male
#>   female      3    5
#>   male        0    7

chisq.test(table(newdf$First, newdf$Senior))
#> Warning in chisq.test(table(newdf$First, newdf$Senior)): Chi-squared
#> approximation may be incorrect
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  table(newdf$First, newdf$Senior)
#> X-squared = 1.356, df = 1, p-value = 0.2442

newdf %>% 
   filter(!is.na(First) & !is.na(Senior))
#> # A tibble: 15 x 3
#> # Groups:   pmid [15]
#>    pmid     First  Senior
#>    <chr>    <chr>  <chr> 
#>  1 18958667 male   male  
#>  2 18751818 male   male  
#>  3 18751811 male   male  
#>  4 18751810 female female
#>  5 18088800 male   male  
#>  6 17532688 male   male  
#>  7 17319096 female male  
#>  8 17042011 female male  
#>  9 17042010 female male  
#> 10 17042006 female male  
#> 11 17042002 female male  
#> 12 17041993 female female
#> 13 17041992 female female
#> 14 17041990 male   male  
#> 15 17041989 male   male

table(newdf$First, newdf$Senior)
#>         
#>          female male
#>   female      3    5
#>   male        0    7

chisq.test(table(newdf$First, newdf$Senior))
#> Warning in chisq.test(table(newdf$First, newdf$Senior)): Chi-squared
#> approximation may be incorrect
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  table(newdf$First, newdf$Senior)
#> X-squared = 1.356, df = 1, p-value = 0.2442

Your data

mydf <- tibble(
gender = c("male", "male", "male", "male", "male", 
                          "male", "male", "female", "female", "male", "male", "male", "female", 
                          "male", "male", "male", "male", "male", "female", "male", "female", 
                          "male", "female", "male", "female", "male", "female", "female", 
                          "male", "female", "male", "female", "male", "female", "female", 
                          "male", "male", "female", "male", "female", "female", "female", 
                          "female", "female", "female", "female", "male", "male", "male", 
                          "male"), 
authorship = c("First", "Senior", "First", "First", 
                                                  "Senior", "First", "Senior", "First", "Senior", "First", "Senior", 
                                                  "First", "First", "Senior", "First", "First", "First", "Senior", 
                                                  "First", "Senior", "First", "Senior", "First", "First", "First", 
                                                  "First", "First", "First", "Senior", "First", "Senior", "First", 
                                                  "Senior", "First", "First", "Senior", "First", "First", "Senior", 
                                                  "First", "First", "First", "Senior", "First", "Senior", "First", 
                                                  "First", "Senior", "First", "Senior"), 
pmid = c("18958667", "18958667", 
                                                                                                  "18958651", "18751818", "18751818", "18751811", "18751811", "18751810", 
                                                                                                  "18751810", "18088800", "18088800", "17710072", "17977216", "17762065", 
                                                                                                  "17611457", "17611433", "17532688", "17532688", "17405310", "17386862", 
                                                                                                  "17319096", "17319096", "17300028", "17282480", "17177771", "17124681", 
                                                                                                  "17093906", "17042011", "17042011", "17042010", "17042010", "17042006", 
                                                                                                  "17042006", "17042003", "17042002", "17042002", "17042001", "17041999", 
                                                                                                  "17041997", "17041995", "17041994", "17041993", "17041993", "17041992", 
                                                                                                  "17041992", "17041991", "17041990", "17041990", "17041989", "17041989")
                                                  )