A version of this question has been asked a few times but never in the simplest way. Basically, the stats::chisq.test
function doesn't work when the sample sizes between the two groups are uneven, despite the fact that chi-square tests are supposed to work with unequal sample sizes, from what I understand.
Here is some test data:
df1 <- data.frame("x" = c("Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No"))
df2 <- data.frame("x" = c("Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","Yes","Yes","No"))
My goal is to see whether there is a difference in the outcome x
(i.e., is the outcome "yes" or "no") between two groups of unequal sample size. But when I run the following code:
chisq.test(table(df1$x,df2$x))
I get the following error:
Error in table(df1$x, df2$x) : all arguments must have the same length
Is there a simple fix for this besides creating a new dataframe that has equal sample sizes by adding NAs to the shorter df? Why does this error even exist if chi-square tests can run with unequal sample sizes in the groups being compared?
Ok, so this is a pretty elementary statistical issue but it took a lot of effort for me to figure this out and I think other people might get similarly confused about some of this. This is also quite a fraught issue because it can impact how you interpret your data (the p-values are wrong if you set this up incorrectly!). So it's important to wrap your head around.
Imagine you have a dataset like this:
You're interested in whether being in group1 and group2 is associated with being hot or cold. If you're like me, you might assume you can do a chi-square test comparing the two groups with:
Resulting in:
Those statistics are obviously incorrect. The reason is the structure of your data. Rather than comparing proportions in group1 to proportions in group2, R is doing a sort of rowwise comparison of proportions of people who are hot in group1 and hot in group2 to people who are hot in group1 and cold in group2, etc., an analysis that doesn't make sense given your question. You can see this by calling the observed frequency table that the chi-square test is basing the analysis on:
To answer the question you're actually interested in ("is there an association between group and temperature"), you need to change the structure of the data you are calling in the chi-square function:
Now we can call the chi-square function correctly, and we see that the observed frequencies are what we expected:
Of course, you don't actually have to reformat your data like this to do the chi-square test. Instead, you can use the helpful code from the other answers above to create a frequency table that has the values you're interested in. But for me at least it was helpful to write all this out to see what you're actually testing. I think in general, if you're running into issues where you're running chi-square tests and R is throwing errors about uneven rows, you might have set up your chi-square function incorrectly.