Combine and transpose many fixed-format dataset files quickly

1.2k views Asked by At

What I have: ~100 txt files, each has 9 columns and >100,000 rows What I want: A combined file, with only 2 of the columns but all the rows. then this should be transposed for an output of >100,000 columns & 2 rows.

I've created the below function to go systematically through the files in a folder, pull the data I want, and then after each file, join together with the original template.

Problem: This works fine on my small test files, but when I try doing it on large files, I run into a memory allocation issue. My 8GB of RAM just isn't enough, and I assume that part of that is in how I wrote my code.

My Question: Is there a way to loop through the files and then join all at once at the end to save processing time?

Also, if this is the wrong place to put this kind of thing, what is a better forum to get input on WIP code??

##Script to pull in genotype txt files, transpose them, delete commented rows & 
## & header rows, and then put files together.

library(plyr)

## Define function
Process_Combine_Genotype_Files <- function(
        inputdirectory = "Rdocs/test", outputdirectory = "Rdocs/test", 
        template = "Rdocs/test/template.txt",
        filetype = ".txt", vars = ""
        ){

## List the files in the directory & put together their path
        filenames <- list.files(path = inputdirectory, pattern = "*.txt")
        path <- paste(inputdirectory,filenames, sep="/")


        combined_data <- read.table(template,header=TRUE, sep="\t")

## for-loop: for every file in directory, do the following
        for (file in path){

## Read genotype txt file as a data.frame
                currentfilename  <- deparse(substitute(file))
                currentfilename  <- strsplit(file, "/")
                currentfilename <- lapply(currentfilename,tail,1)

                data  <- read.table(file, header=TRUE, sep="\t", fill=TRUE)

                #subset just the first two columns (Probe ID & Call Codes)
                #will need to modify this for Genotype calls....
                data.calls  <- data[,1:2]

                #Change column names & row names
                colnames(data.calls)  <- c("Probe.ID", currentfilename)
                row.names(data.calls) <- data[,1]


## Join file to previous data.frame
                combined_data <- join(combined_data,data.calls,type="full")


## End for loop
        }
## Merge all files
        combined_transcribed_data  <- t(combined_data)
print(combined_transcribed_data[-1,-1])
        outputfile  <- paste(outputdirectory,"Genotypes_combined.txt", sep="/")        
        write.table(combined_transcribed_data[-1,-1],outputfile, sep="\t")

## End function
}

Thanks in advance.

2

There are 2 answers

5
Rentrop On BEST ANSWER

Try:

filenames <- list.files(path = inputdirectory, pattern = "*.txt")
require(data.table)
data_list <- lapply(filenames,fread, select = c(columns you want to keep))

now you have a list of all you data. Assuming all the txt-files do have the same column-structure you can combine them via:

data <- rbindlist(data_list)

transposing data:

t(data)

(Thanks to @Jakob H for select in fread)

2
Jacob H On

If speed/working memory is the concern then I would recommend using Unix to do the merging. In general, Unix is faster than R. Further, Unix does not require that all information be loaded into RAM, rather it reads information in chunks. Consequently, Unix is never memory bound. If you don't know Unix but plan to manipulate large files frequently in the future, then learn Unix. It is simple to learn and very powerful. I will do an example with csv files.

Generating CSV files in R

for (i in 1:10){
  write.csv(matrix(rpois(1e5*10,1),1e5,10), paste0('test',i,'.csv'))
}

In Shell (i.e on a Mac)/Terminal (i.e on a Linux Box)/Cygwin (i.e. on Windows)

cut -f 2,3 -d , test1.csv > final.csv #obtain column 2 and 3 form test1.csv
cut -f 2,3 -d , test[2,9].csv test10.csv | sed 1d >> final.csv #removing header in test2.csv onward 

Notice if you have installed Rtools, then you can run all these Unix commands from R with the system function.

To transpose read final.csv into R and transpose.

UPDATE:

I timed the above code. It took .4 secs to run. Consequently to do this for 100 files rather than just 10 files it will likely take 4 secs. I have not timed the R code, however, it may be the case that the Unix and R program have similar performance when there is only 10 files, however, with 100+ files, your computer will likely become memory bound and R will likely crash.