R: efficient and fast splitting large data files in a directory by a variable and write out the files

82 views Asked by At

I have come into a problem regarding how to fast and efficient read and split a list of very large transaction data files by a column called SecurityID, inside each transaction data file, there can be different amount of transactions per SecurityID, and for each transaction day the Security ID might not be the same. After splitting the data by SecurityID, I want to write out the output into smaller csv files, each csv file is by SecurityID. The additional function is to detect when a new transaction data file is read in, are the same SecurityID output exists, if so, append the new data to the old SecurityID.csv, if not, then create a new one.

I have attached my codes below, it does work on a small set of data. However on the full set of data it is very very slow. This prompted me to think there is a better, faster strategy to do this. And yes, unfortunately I only have my Mac to do this.

To put the file sizes into perspectives: Each transaction data (a day) files is about 7-10GB, I have 10 years worth of data, and only processing them year by year. My mac memory is 32GB, and I think the memory is certainly a problem, Rstudio has been in a whitescreen state for 3days, and it continue to write files.

I think I have to figure out some ways to vectorise this code better, and perhaps utlise parallel processing.

set working directories etc.

# Input of market and transaction data
external.dir <- "my external hard drive"
input.dir.marketdata <-  file.path(external.dir,"marketdata")


# create output of market and transaction data

dir.create(file.path(external.dir, "market_data_split"))

out.dir.marketdata <- file.path(external.dir, "market_data_split")

First load a couple libraries

# loading libraries
library(tidyverse)
library(plyr)
library(data.table)

Then set the column to split by

# Define col to split by as global var
col <- "SecurityID"


# Split and write out the market data
files <- list.files(input.dir.marketdata, pattern = '*MarketData.csv$', full.names = TRUE)
files


 for (i in seq_along(files)) {

    input.dat <- data.table::fread(files[i],header = TRUE,stringsAsFactors = TRUE)

    sptdf <- split(input.dat, input.dat[[col]])

    outfile <- as.character(unique(names(sptdf)))

    for (j in seq_along(outfile)) {

      new_data <- sptdf[[outfile[j]]]

      outfile.name <- paste0(outfile[j], ".csv")

      check.files <- list.files(out.dir.marketdata, pattern = "*.csv")

      if (outfile.name %in% check.files) {
        print(paste0(outfile.name, " already exisit!"))
        exisiting.data <- data.table::fread(file.path(out.dir.marketdata,outfile.name),
                                            header =TRUE,stringsAsFactors = TRUE)
        combined_data <- rbind(exisiting.data, new_data)
        data.table::fwrite(combined_data,
                           file = file.path(out.dir.marketdata, outfile.name),
                           row.names = FALSE,
                           quote = FALSE)
        gc()
     } else {
            print(paste0(outfile.name, " is a new data!"))
            data.table::fwrite(new_data,
                               file = file.path(out.dir.marketdata, outfile.name),
                               row.names = FALSE,
                               quote = FALSE)
            gc()
     }
    }
 }
0

There are 0 answers