I want to cast my data (data.frame) from long to wide format with the values of "ITEM" as columns and values ("ITEM2") (see below):
Therefore I use the dcast-function from the package reshape2:
df <= dcast(df,SEQUENCEID + EVENTID ~ ITEM, value.var="ITEM2")
Doing this everything works fine.
But with 7m data records in my data frame I was struggling with memory limits. Thus I decided to transform my data.frame in ffdf and use the ffdfdply-function from the package ffbase to cast the frame.
To make sure that every split has the same columns in the same order I extract the values from "ITEM" in advance, append the column with N/A if does not exist and order all columns alphabetically.
Below the whole code:
#Extract items
item<-as.character(unique(lo_raw$ITEM))
#Transform to ffdf
ff_raw<-as.ffdf(lo_raw)
ff_raw$SEQUENCEID<-as.character.ff(ff_raw$SEQUENCEID)
#Function dcast
castff<-function(df,item){
df=dcast(df,SEQUENCEID + EVENTID ~ ITEM, value.var="ITEM2")
for(i in item){
if (!(i %in% colnames(df))){
df[,i]<-NA
}
}
df<-df[,order(colnames(df))]
df
}
#Apply dcast
ff_pivot<-ffdfdply(x=ff_raw,split=ff_raw$SEQUENCEID,FUN=function(df,item) castff(df,item),item=item,BATCHBYTES=1000000,trace=TRUE)
Unfortunaly I get the following error at appending the results from the second split to the first (with the trace):
2016-12-08 09:25:35, calculating split sizes
2016-12-08 09:25:37, building up split locations
2016-12-08 09:25:51, working on split 1/139, extracting data in RAM of 106 split elements, totalling, 0.00093 GB, while max specified data specified using BATCHBYTES is 0.00093 GB
2016-12-08 09:25:52, ... applying FUN to selected data
2016-12-08 09:25:55, ... appending result to the output ffdf
2016-12-08 09:26:02, working on split 2/139, extracting data in RAM of 172 split elements, totalling, 0.00093 GB, while max specified data specified using BATCHBYTES is 0.00093 GB
2016-12-08 09:26:03, ... applying FUN to selected data
2016-12-08 09:26:05, ... appending result to the output ffdf
Error in ff(vmode = "integer", length = length(x), levels = as.character(levs)) : unable to open
In addition: Warning message:
In is.na(levs) : is.na() applied to non-(list or vector) of type 'NULL'
Calculating only one split with less records without appending works fine.
Could somebody help?
Thank you.