Two-way contingency table in R

839 views Asked by At

I have a dataframe and I want to output a two-way contingency table from two of the columns. They both have values "Too Little", "About Right" or "Too Much".

I'm typing

df %>%
  filter(!is.na(col1)) %>%
  group_by(col1) %>%
  summarise(count = n())

for both of them separately and get something like this:

col1        count
<fctr>      <int>
Too Little  19259           
About Right 9539            
Too Much    2816    

What I would like to achieve is this:

       Too Little   About Right   Too Much   Total
col1   19259        9539          2816       31614
col2   20619        9374          2262       32255
Total  39878       18913          5078       63869

I've been trying to use table function

addmargins(table(df$col1, df$col2))

But the result is not what I want

              Too Little About Right Too Much   Sum
  Too Little       13770        4424      740 18934
  About Right       4901        3706      700  9307
  Too Much          1250         800      679  2729
  Sum              19921        8930     2119 30970
2

There are 2 answers

4
lukeA On

I'd give tabulate a try, which is the foundation for table (see ?tabulate). For example given

set.seed(123)
vals <- LETTERS[1:3]
df <- as.data.frame(replicate(3, sample(vals, 5, T)))
df <- data.frame(lapply(df, "levels<-", vals))

then you could do

m <- t(sapply(df, tabulate, nbins = length(vals)))
colnames(m) <- vals
addmargins(m)
#     A B C Sum
# V1  1 1 3   5
# V2  1 3 1   5
# V3  1 2 2   5
# Sum 3 6 6  15

Or (via @thelatemail) just

addmargins(t(sapply(df, table)))
#     A B C Sum
# V1  1 1 3   5
# V2  1 3 1   5
# V3  1 2 2   5
# Sum 3 6 6  15   
2
zx8754 On

We can use table in a loop then rbind:

# Using dummy data from @lukeA's answer

addmargins(do.call(rbind, lapply(df1, table)))
#     A B C Sum
# V1  1 1 3   5
# V2  1 3 1   5
# V3  1 2 2   5
# Sum 3 6 6  15

Benchmarking

# bigger data
set.seed(123)
vals <- LETTERS[1:20]
df1 <- as.data.frame(replicate(20, sample(vals, 100000, T)))
df1 <- data.frame(lapply(df1, "levels<-", vals))


microbenchmark::microbenchmark(
  lukeA = {
    m1 <- t(sapply(df1, tabulate, nbins = length(vals)))
    colnames(m1) <- vals
    m1 <- addmargins(m1)
  },
  # as vals only used for luke's solution, keep it in.
  lukeA_1 = {
    vals <- LETTERS[1:20]
    m2 <- t(sapply(df1, tabulate, nbins = length(vals)))
    colnames(m2) <- vals
    m2 <- addmargins(m2)
  },
  thelatemail = {m3 <- addmargins(t(sapply(df1, table)))}, 
  zx8754 = {m4 <- addmargins(do.call(rbind, lapply(df1, table)))}
)
# Unit: milliseconds
#        expr       min        lq      mean    median        uq        max neval
#       lukeA  2.349969  2.371922  2.518447  2.473839  2.558653   3.363738   100
#     lukeA_1  2.351680  2.377196  2.523473  2.473839  2.542831   3.459242   100
# thelatemail 38.316506 42.054136 43.785777 42.674912 44.234193  90.287809   100
#      zx8754 38.695101 41.979728 44.933602 42.762006 44.244314 110.834292   100