Efficient way of spliting, applying function and returning data.frame with variable vector length

101 views Asked by At

I am currently trying to use plyr + reshape2 to proccess my data, but it is taking a lot of time. I have a dataframe (df) with 3 columns: network, user_id and date. My goal is:

  1. To split df in 2 levels (network and user_id);
  2. apply a function (get_interval) in each split;
  3. bind the results in another dataframe (df2).

get_interval returns a vector of the same length as the number of rows of the input. Thus, df2 has the same size of df, but with the results computed by get_interval.

The problem is that I cannot use ddply directly, since it only handles vectors of equal length and the results of the function have varied length.

I came up with this solution:

aux <- melt(dlply(df,.(network,user_id), get_interval))
df2 <- cbind(interval=aux$value,colsplit(aux$L1,"\\.",names=c("network","user_id")))

But it is very inefficient, and since df is quite big I waste hours every time I have to run it. Is there a way of doing this more efficiently?

EDIT

The basic operation of get_interval is as follows:

get_interval <- function(df){
    if(nrow(df) < 2)
        return (NA)
    x <- c(NA,df$date[-1] - df$date[-nrow(df)])
    return(x) ## ceiling wont work because some intervals are 0.
} 

It is possible to generate this data artificially with:

n <- 1000000
ref_time <- as.POSIXct("2013-12-17 00:00:00")
interval_range <- 86400*10 # 10 days
df <- data.frame(user_id=floor(runif(n,1,n/10)),
                network=gl(2,n,labels=c("anet","unet")),
                value=as.POSIXct(ref_time - runif(n,0,interval_range)))
0

There are 0 answers