chi square over multiple groups and variables

1.6k views Asked by At

I have a huge dataset with several groups (factors with between 2 to 6 levels), and dichotomous variables (0, 1).

example data

DF <- data.frame(
group1 = sample(x = c("A","B","C","D"), size  = 100, replace = T),
group2 = sample(x = c("red","blue","green"), size  = 100, replace = T),
group3 = sample(x = c("tiny","small","big","huge"), size  = 100, replace = T),
var1 = sample(x = 0:1, size  = 100, replace = T),
var2 = sample(x = 0:1, size  = 100, replace = T),
var3 = sample(x = 0:1, size  = 100, replace = T),
var4 = sample(x = 0:1, size  = 100, replace = T),
var5 = sample(x = 0:1, size  = 100, replace = T))

I want to do a chi square for every group, across all the variables.

library(tidyverse)
library(rstatix)

chisq_test(DF$group1, DF$var1)
chisq_test(DF$group1, DF$var2)
chisq_test(DF$group1, DF$var3)
...
etc

I managed to make it work by using two nested for loops, but I'm sure there is a better solution

groups <- c("group1","group2","group3")
vars <- c("var1","var2","var3","var4","var5")

results <- data.frame()
for(i in groups){
  for(j in vars){
    test <- chisq_test(DF[,i], DF[,j])
    test <- mutate(test, group=i, var=j)
    results <- rbind(results, test)
  }
}
results

I think I need some kind of apply function, but I can't figure it out

2

There are 2 answers

0
Eric On

Here is one way to do it with apply. I am sure there is an even more elegant way to do it with dplyr. (Note that here I extract the p.value of the test, but you can extract something else or the whole test result if you prefer).

res <- apply(DF[,1:3], 2, function(x) { 
                            apply(DF[,4:7], 2, 
                              function(y) {chisq.test(x,y)$p.value})
                            })

0
Kevin A On

Here's a quick and easy dplyr solution, that involves transforming the data into long format keyed by group and var, then running the chi-sq test on each combination of group and var.

DF %>%
  pivot_longer(starts_with("group"), names_to = "group", values_to = "group_val") %>%
  pivot_longer(starts_with("var"), names_to = "var", values_to = "var_val") %>%
  group_by(group, var) %>%
  summarise(chisq_test(group_val, var_val)) %>%
  ungroup()