I am trying to do some market basket analysis using the arules
package, but when I use the summary()
function on an itemMatrix
object to check which are the most frequent items, the numbers do not add up.
If I do:
library(arules)
x <- read.transactions("Supermarket2014-15.csv")
summary(x)
I get:
transactions as itemMatrix in sparse format with
5001 rows (elements/itemsets/transactions) and
997 columns (items) and a density of 0.003557162
most frequent items:
45 28 42 35 22 (Other)
503 462 444 440 413 15474
But if I check with a for
loop, or even in Excel, the count for the product 45 is 513 and not 503. The same for 28, which should be 499, and so on.
The odd thing is if I sum up all the totals (15474+413+440+444+462+503)
I get the correct number for the total of transacted products.
The data has several NA
values and products are factors.
And here is the raw data (Day ranges from 1 to 28, Product ranges from 1 to 50):
If you look at the result of your
str(x)
call then you see under@iteminfo
and$labels
that some items have labels like"1;1"
, etc. This means that the items are not correctly separated after reading the file in. The default separator inread.transactions()
is a white space, but you seem to have (some) semicolons there. Trysep=";"
inread.transactions()
.