Merging and appending a list of ffdf dataframes

61 views Asked by At

I would like to read a vector of CSV files names as ffdf data frames and combine them into one big ffdf data frame. I have found solutions using other r packages; however, my issue is my data (combined) can reach 40GB which definitely needs to be stored on disk, as ff package does, and not in the RAM. There are awesome solutions here using RAM storage, as far as I know.

library(ffbase)
library(ff)

# Create list of csv files
csv_files <- list.files(path = input_path,
                        pattern="*.csv",
                        full.names = T)

# my approach so far
# this use fread, and it appears to be consuming RAM 

# Read the files in, assuming comma separator
csv_files_df <- lapply(csv_files, function(x) {
y<-unlist(str_split(x, "[.]"))[1]
    assign(y,
   as.ffdf(fread(x,stringsAsFactors = T)))})

# Combine them
combined_df <- do.call("ffdfappend", lapply(csv_files_df, as.ffdf))

When I try to combine them, it fires this error.

> combined_df <- do.call("ffdfappend", lapply(csv_files_df, as.ffdf))
Error in ffdfappend(list(virtual = list(VirtualVmode = c("double", "integer",  : 
  'list' object cannot be coerced to type 'logical'

Summary: I would like to read and merge the CSV files using only ff package without the need for another package to avoid OOM (Out Of Memory) status.

1

There are 1 answers

0
DaveArmstrong On BEST ANSWER

The ffdfappend() function only takes two data arguments - x and y. When you provide a list, it is assuming some of the data frames are the other arguments to ffdfappend(). To use this function in the way you intend, you probably need to write it in a loop, something like this:

csv_files <- list.files(path = input_path,
                        pattern="*.csv",
                        full.names = T)

# my approach so far
# this use fread, and it appears to be consuming RAM 

read <- function(x) {
  y<-unlist(str_split(x, "[.]"))[1]
  assign(y,
         as.ffdf(fread(x,stringsAsFactors = T)))}

# Read the files in, assuming comma separator
out <- read(csv_files[1])

for(i in 2:length(csv_files)){
  out <- ffdfappend(out, read(csv_files(i)))
}