R arulessequence - Preparing data for cspade mining

407 views Asked by At

I am trying to mine sequences in R via the arulessequence implementation of cspade.

My data frame looks like this:

 items sequenceId eventId size 
 A          1        1     1
 B          2        1     1
 C          2        2     1
 A          3        1     1

This data frame was created from an existing data set via the following code (removing unnecessary columns and creating the sequences):

data %>%
  select(seqId, sequence, items) %>%
  group_by(seqId) %>%
  mutate(basketSize = 1, sequence = rank(sequence)) %>% 
  ungroup() %>%
  mutate(seqId = ordered(seqId), sequence = ordered(sequence)) %>%
  write.table("data.txt", sep=" ", row.names = FALSE, col.names = FALSE, quote = FALSE)


data <- read_baskets("data.txt", info = c("sequenceID", "eventID", "size"))

as(data, "data.frame") #shows the data frame above!

So far so good!

However when I try:

 s1 <- cspade(data, parameter = list(support = 0.4), control = list(verbose = TRUE))

I get the following error:

 Error in makebin(data, file) : 'eid' invalid (strict order)

I have read elsewhere that this is because cspade needs the event and sequence id to be ordered. But how do I specify this? Clearly ordering the factors before exporting them to ".txt" does not work.

Edit: Some further details Just to explain the code to create the data input for cspade a bit more. Originally the sequence-variable had some missing steps (e.g. 1,3,4 for some sequences) because I had filtered some events. Therefore I ran a rank-function on it to reindex the events per sequence. The size-column is totally unecessary (it is constant) but was included in the sample code in the documentation for arulessequence, which is why I included it too.

0

There are 0 answers