I want to cast my data (data.frame) from long to wide format with the values of "ITEM" as columns and values ("ITEM2") (see below):
Long format:
Wide format:
Therefore I use the dcast-function from the package reshape2:
df <= dcast(df,SEQUENCEID + EVENTID ~ ITEM, value.var="ITEM2")
Doing this everything works fine. But with 7m data records in my data frame I was struggling with memory limits. Thus I decided to use ddply from the package plyr.
To make sure that every split has the same columns in the same order I extract the values from "ITEM" in advance, append the column with N/A if does not exist and order all columns alphabetically.
Below the whole code:
#Example data
lo_raw <- data.frame(SEQUENCEID=rep(1546842, 10),
EVENTID=c(5468503146,5468503146,5468503146,5468503147,5468503147,5468503148,5468503148,5468503148,5468503148,5468503148),
ITEM =c("cakes","limonade","coffee","coffee","juice","limonade","cakes","water","fruits","vegetable"),
ITEM2=c("cakes","limonade","coffee","coffee","juice","limonade","cakes","water","fruits","vegetable"),
SPLIT=rep(1547000, 10))
#Extract items
item <- as.character(unique(lo_raw$ITEM))
#Function dcast
castff <- function(df,item){
df = dcast(df, SEQUENCEID + EVENTID ~ ITEM, value.var="ITEM2")
for(i in item){
if (!(i %in% colnames(df))){
df[,i] <- NA
}
}
df <- df[,c(1,2,(order(colnames(df[,3:dim(df)[2]])))+2)]
df
}
#Apply dcast
df_pivot <- ddply(lo_raw, .(SPLIT), .fun=function(lo_raw,item) castff(lo_raw,item), item=item, .progress="text", .inform=TRUE)
When executing ddply
, used RAM is increasing at runtim until it reached its max (12 GB). Thus performance is very slow and I terminated R after a couple of hours.
Is there an alternative way to cast the whole dataset?
Thanks in advance.