Consider a dataset where each row is a basket of 3 fruits.
library(data.table)
baskets <- data.table(fruit_1 = c('orange', 'apple', 'apple', 'pear')
,fruit_2 = c('apple', 'pear', 'kiwi', 'kiwi')
,fruit_3 = c('pear', 'kiwi', 'blueberry', 'blueberry'))
What would be an efficient way to calculate correlations between different fruits? In other words, how often different fruits appear in the same basket/row together? I'm trying to get the pairwise correlation for every pair of 2 fruits (for example, "apples and pears", "apples and kiwis", etc.).
The best approach I can think of now is to make indicator variables/binary columns for each fruit and then do the correlation of those. Is there a better way than that, computationally or otherwise?
EDIT: I updated this part to show a table that looks like my desired result. It would probably want "agreement/disagreement score" or something instead of the correlation, but you get the idea.
baskets$apple = 0
baskets[fruit_1=='apple']$apple = 1
baskets[fruit_2=='apple']$apple = 1
baskets[fruit_3=='apple']$apple = 1
baskets$pear = 0
baskets[fruit_1=='pear']$pear = 1
baskets[fruit_2=='pear']$pear = 1
baskets[fruit_3=='pear']$pear = 1
baskets$kiwi = 0
baskets[fruit_1=='kiwi']$kiwi = 1
baskets[fruit_2=='kiwi']$kiwi = 1
baskets[fruit_3=='kiwi']$kiwi = 1
#looking for a table like this, but with every combination of fruit and imagining thousands of rows
desired_result = data.frame(fruit_1 = c('apple', 'pear', 'kiwi'),
fruit_2 = c('pear', 'kiwi', 'apple'),
similarity = c(cor(baskets$apple, baskets$pear),
cor(baskets$pear, baskets$kiwi),
cor(baskets$kiwi, baskets$apple)
)
)
This feels like an okay solution, but not a great one. So I wanted to see what better options there are. Data.table is highly preferable because I'm much better at that but I'm open to whatever.
You can try
coralong withas.data.frame.table, e.g.,and you will obtain