Run multiple Mann Whitney U's using group_by in R

204 views Asked by At

I am trying to conduct several Mann Whitney U's which compare the impacts of population on offspring sex ratio skews. I'm using R studio. The dataset looks like:

data <- data.frame(
  DamID = 1:50,
  FemaleOffspring = sample(1:10, 50, replace = TRUE),
  MaleOffspring = sample(1:10, 50, replace = TRUE),
  SexRatio = runif(50, min = 0, max = 1),
  BirthPop = sample(c('A', 'B'), 50, replace = TRUE),
  Species = sample(c('R','X', 'Y', 'Z'), 50, replace = TRUE)
)

I've written the following line of code

library(dplyr)
sumstats <- data %>% 
  group_by(Species, BirthPop) %>%
  summarize(median=median(SexRatio),
            IQR=IQR(SexRatio),
            Min=min(SexRatio),
            Max=max(SexRatio),
            n=n(),
            wilcox_p = wilcox.test(SexRatio ~ factor(BirthPop), data = ., alternative = "two.sided")$p.value

Which gives me one p value for the entire dataset when I need a different p value for each species. Not sure what to do about this. Thanks in advance!

3

There are 3 answers

0
r2evans On

Two problems:

  1. Use cur_data(). When you use ., the call to wilcox.test() see all of the data, and it does not honor the grouping that group_by has imposed.

  2. When you group by BirthPop, then each call to wilcox.test gets only "A" or only "B", but it needs to see both to be able to perform the test.

I suggest do two levels of stats, first on both Species and BirthPop (to get the majority of your statistics), and then once on just Species for your Wilcox tests.

stats1 <- data %>%
  group_by(Species, BirthPop) %>%
  summarize(
    median = median(SexRatio),
    IQR = IQR(SexRatio),
    Min = min(SexRatio),
    Max = max(SexRatio),
    n = n()
  ) %>%
  ungroup()
stats1
# # A tibble: 8 × 7
#   Species BirthPop median   IQR    Min   Max     n
#   <chr>   <chr>     <dbl> <dbl>  <dbl> <dbl> <int>
# 1 R       A         0.616 0.562 0.0405 0.710     5
# 2 R       B         0.286 0.357 0.0907 0.740     6
# 3 X       A         0.711 0.713 0.0554 0.966     6
# 4 X       B         0.560 0.204 0.0411 0.980    13
# 5 Y       A         0.471 0.295 0.108  0.815     7
# 6 Y       B         0.425 0.129 0.201  0.686     6
# 7 Z       A         0.365 0.482 0.280  0.910     5
# 8 Z       B         0.452 0.378 0.0737 0.830     2

stats2 <- data %>%
  group_by(Species) %>%
  summarize(
    wilcox_p = wilcox.test(SexRatio ~ factor(BirthPop), data = cur_data(), 
                           alternative = "two.sided")$p.value
  ) %>%
  ungroup()
stats2
# # A tibble: 4 × 2
#   Species wilcox_p
#   <chr>      <dbl>
# 1 R          0.931
# 2 X          0.831
# 3 Y          0.534
# 4 Z          0.857

We can easily bring these back together with a merge/join operation:

full_join(stats1, stats2, by = "Species")
# # A tibble: 8 × 8
#   Species BirthPop median   IQR    Min   Max     n wilcox_p
#   <chr>   <chr>     <dbl> <dbl>  <dbl> <dbl> <int>    <dbl>
# 1 R       A         0.616 0.562 0.0405 0.710     5    0.931
# 2 R       B         0.286 0.357 0.0907 0.740     6    0.931
# 3 X       A         0.711 0.713 0.0554 0.966     6    0.831
# 4 X       B         0.560 0.204 0.0411 0.980    13    0.831
# 5 Y       A         0.471 0.295 0.108  0.815     7    0.534
# 6 Y       B         0.425 0.129 0.201  0.686     6    0.534
# 7 Z       A         0.365 0.482 0.280  0.910     5    0.857
# 8 Z       B         0.452 0.378 0.0737 0.830     2    0.857
0
Michael M On

Base R is so much easier here and won't deprecate so soon as cur_data().

wilc_p <- function(X) {
  wilcox.test(SexRatio ~ factor(BirthPop), data = X)$p.value
}

c(by(data, data$Species, FUN = wilc_p))

Output:

        R         X         Y         Z 
0.5555556 0.5737374 0.9497169 0.6484848 
0
Onyambu On

I once wrote a function multiple_tests as this was a common phenomena on my end. You can find the function on github. Though be cautious of the multiplicity effect as the independent testing increases the familywise error-rate

remotes::install_github('oonyambu/SLR')
SLR::multiple_tests(SexRatio~BirthPop|Species, data,FUN = wilcox.test)

  Species response grp statistic   p.value null.value alternative                       method
1       R SexRatio A:B        18 0.4908425          0   two.sided Wilcoxon rank sum exact test
2       Z SexRatio A:B        11 0.1806527          0   two.sided Wilcoxon rank sum exact test
3       Y SexRatio A:B        28 0.3154166          0   two.sided Wilcoxon rank sum exact test
4       X SexRatio A:B         6 0.2000000          0   two.sided Wilcoxon rank sum exact test

data used:

set.seed(5)
data <- data.frame(
  DamID = 1:50,
  FemaleOffspring = sample(1:10, 50, replace = TRUE),
  MaleOffspring = sample(1:10, 50, replace = TRUE),
  SexRatio = runif(50, min = 0, max = 1),
  BirthPop = sample(c('A', 'B'), 50, replace = TRUE),
  Species = sample(c('R','X', 'Y', 'Z'), 50, replace = TRUE)
)