Chi-square test in R with unequal sample sizes

2k views Asked by At

A version of this question has been asked a few times but never in the simplest way. Basically, the stats::chisq.test function doesn't work when the sample sizes between the two groups are uneven, despite the fact that chi-square tests are supposed to work with unequal sample sizes, from what I understand.

Here is some test data:

df1 <- data.frame("x" = c("Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No"))
df2 <- data.frame("x" = c("Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","Yes","Yes","No"))

My goal is to see whether there is a difference in the outcome x (i.e., is the outcome "yes" or "no") between two groups of unequal sample size. But when I run the following code:

chisq.test(table(df1$x,df2$x))

I get the following error:

Error in table(df1$x, df2$x) : all arguments must have the same length

Is there a simple fix for this besides creating a new dataframe that has equal sample sizes by adding NAs to the shorter df? Why does this error even exist if chi-square tests can run with unequal sample sizes in the groups being compared?

2

There are 2 answers

0
Bradley Allf On BEST ANSWER

Ok, so this is a pretty elementary statistical issue but it took a lot of effort for me to figure this out and I think other people might get similarly confused about some of this. This is also quite a fraught issue because it can impact how you interpret your data (the p-values are wrong if you set this up incorrectly!). So it's important to wrap your head around.

Imagine you have a dataset like this:

df <- data.frame(group1 = c(rep("hot",9),"cold"),
                 group2 = c(rep("hot",5),rep("cold",5)))
> df
   group1 group2
1     hot    hot
2     hot    hot
3     hot    hot
4     hot    hot
5     hot    hot
6     hot   cold
7     hot   cold
8     hot   cold
9     hot   cold
10   cold   cold

You're interested in whether being in group1 and group2 is associated with being hot or cold. If you're like me, you might assume you can do a chi-square test comparing the two groups with:

m <- chisq.test(df$group1, df$group2)
m

Resulting in:

    Pearson's Chi-squared test with Yates' continuity correction

data:  df$group1 and df$group2
X-squared = 0, df = 1, p-value = 1

Those statistics are obviously incorrect. The reason is the structure of your data. Rather than comparing proportions in group1 to proportions in group2, R is doing a sort of rowwise comparison of proportions of people who are hot in group1 and hot in group2 to people who are hot in group1 and cold in group2, etc., an analysis that doesn't make sense given your question. You can see this by calling the observed frequency table that the chi-square test is basing the analysis on:

m$observed
         df$group2
df$group1 cold hot
     cold    1   0
     hot     4   5

To answer the question you're actually interested in ("is there an association between group and temperature"), you need to change the structure of the data you are calling in the chi-square function:

df2 <- df %>% 
  pivot_longer(cols = c("group1","group2"),
              names_to = "group",
              values_to = "temperature") %>% 
  arrange(group)
df2
# A tibble: 20 × 2
   group  temperature
   <chr>  <chr>      
 1 group1 hot        
 2 group1 hot        
 3 group1 hot        
 4 group1 hot        
 5 group1 hot        
 6 group1 hot        
 7 group1 hot        
 8 group1 hot        
 9 group1 hot        
10 group1 cold       
11 group2 hot        
12 group2 hot        
13 group2 hot        
14 group2 hot        
15 group2 hot        
16 group2 cold       
17 group2 cold       
18 group2 cold       
19 group2 cold       
20 group2 cold      

Now we can call the chi-square function correctly, and we see that the observed frequencies are what we expected:

> p <- chisq.test(df2$temperature, df2$group)
> p

    Pearson's Chi-squared test with Yates' continuity correction

data:  df2$temperature and df2$group
X-squared = 2.1429, df = 1, p-value = 0.1432

> p$observed
               df2$group
df2$temperature group1 group2
           cold      1      5
           hot       9      5

Of course, you don't actually have to reformat your data like this to do the chi-square test. Instead, you can use the helpful code from the other answers above to create a frequency table that has the values you're interested in. But for me at least it was helpful to write all this out to see what you're actually testing. I think in general, if you're running into issues where you're running chi-square tests and R is throwing errors about uneven rows, you might have set up your chi-square function incorrectly.

3
Ric On
df1 <- data.frame("x" = c("Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No"))
df2 <- data.frame("x" = c("Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","Yes","Yes","No"))

m <-cbind(table(df1),table(df2))
m
#>     [,1] [,2]
#> No     8    3
#> Yes    8   12
chisq.test(m)
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  m
#> X-squared = 1.8742, df = 1, p-value = 0.171