Multiple loops with data.table in R

338 views Asked by At

I try to deal with multiple loops with data.table for quite some time and got frustrated. With sql it seamed quite intuitive, but with R I experience some problems.

For example, I would want to fread one txt file (as I have hundreds of files, each around 1 GB), make the calculations (sum price and quant, when time>my.time and for some selected isin's, grouped by my.time, isin and price), write the results to some csv file, remove original txt file from R memory; then redo these calculations one-by-one for all txt files and append the output csv file.

Lets start with the example data (quite small, just two identical files for illustration):

        time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "1 sec"),"%H:%M:%S")
        n<-length(time)
        isin<-paste("US",1:n,sep="")
        price<-rnorm(n,101,1)
        quant<-rnorm(n,5,1)
        dt<-data.table(time,isin,price,quant)
        write.table(dt,"raw.txt",append = FALSE,sep = ",",col.names = TRUE, row.names = FALSE)
        write.table(dt,"raw2.txt",append = FALSE,sep = ",",col.names = TRUE, row.names = FALSE)

    my.files <- list.files(pattern = "raw*.txt")
    my.time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "5 min"),"%H:%M:%S")
    my.isin<-c("US100","US150","US225","US250","US1050")

Then I try these 2 simple loops:

       for (i in my.files){
              for (j in my.time){
              dt<-fread(i)
        write.table(dt[which(isin %in% my.isin & time>j),
           .(sprice=sum(price),squant=sum(quant),**time.my=j**), by = .(isin,price)],
           "output.csv",append = TRUE,sep = ",",col.names = TRUE)
        rm(dt) 
        }}

Second edit: The loop with j finally started working for me (due to the bolded part). Maybe it would be possible to work without for loops and get the same results?

Thanks very much for help!

1

There are 1 answers

7
ArunK On

The problem you have is the output of your which statement is returning zero rows. First I would convert your time to time type. I've then created a 5 minute grouping variable.

This would first aggregate your tables.

dt[,`:=`(time= as.ITime(strptime(time, format="%H:%M:%S")))]
dt[,`:=`(time5 = format(strptime("1970-01-01", "%Y-%m-%d", tz="UTC") + 
                          round(as.numeric(time)/300)*300,"%H:%M"))]

dt[, list(sprice = sum(price),squant= sum(quant)),by = c("time5","price","isin")][isin %in% my.isin]


#    time5     price    isin    sprice   squant
# 1: 00:00 102.46668     US1 102.46668 3.002960
# 2: 00:00  99.02186     US2  99.02186 5.253252
# 3: 00:00 100.23665     US3 100.23665 6.153950
# 4: 00:00 102.21466     US4 102.21466 3.461051
# 5: 00:00 100.97890     US5 100.97890 5.893336

you can then filter it by your my.isn or time5 greater than custom times?