I am currently trying to use plyr
+ reshape2
to proccess my data, but it is taking a lot of time.
I have a dataframe (df) with 3 columns: network, user_id and date.
My goal is:
- To split df in 2 levels (network and user_id);
- apply a function (
get_interval
) in each split; - bind the results in another dataframe (df2).
get_interval
returns a vector of the same length as the number of rows of the input.
Thus, df2 has the same size of df, but with the results computed by get_interval
.
The problem is that I cannot use ddply
directly, since it only handles vectors of equal length and the results of the function have varied length.
I came up with this solution:
aux <- melt(dlply(df,.(network,user_id), get_interval))
df2 <- cbind(interval=aux$value,colsplit(aux$L1,"\\.",names=c("network","user_id")))
But it is very inefficient, and since df is quite big I waste hours every time I have to run it. Is there a way of doing this more efficiently?
EDIT
The basic operation of get_interval
is as follows:
get_interval <- function(df){
if(nrow(df) < 2)
return (NA)
x <- c(NA,df$date[-1] - df$date[-nrow(df)])
return(x) ## ceiling wont work because some intervals are 0.
}
It is possible to generate this data artificially with:
n <- 1000000
ref_time <- as.POSIXct("2013-12-17 00:00:00")
interval_range <- 86400*10 # 10 days
df <- data.frame(user_id=floor(runif(n,1,n/10)),
network=gl(2,n,labels=c("anet","unet")),
value=as.POSIXct(ref_time - runif(n,0,interval_range)))