I have a large number of CVS files that need tidying up by removing all rows that have an n/a in a specific column. So rather than opening each one manually in Excel, I want to know if it possible to write some R code that will:

  1. Load all the CSV files from a folder into R
  2. Remove all rows that contain n/a in a specific column ('Error') for each data file individually
  3. Change all remaining n/a to ""
  4. Save the tidied data as individual Excel files (keeping the original names plus a little bit on the end so I can tell the RAW file from the processed one)

I am able to do the above steps for an individual file but I am not having much luck expanding my code to do the same for multiple CSV files.

This is what I currently have if I am doing the files one at a time:

df <- read.csv("#filename",
working <- df[!is.na(df$Error),]
working <- sapply(working, as.character)
working[is.na(working)] <- ""

I have managed to import a list of files using this code:

for(i in 1:length(temp)){assign(temp[i],read.csv(temp[i]))}

but am stuck as to how I proceed from there.

The end result I am after is each file, once processed using the code, will not have any blank rows in the 'Error' column and will be saved as an Excel file. At no point do I want to combine the data frames as that will get too messy trying to untangle what data belongs to which file.

Thanks for your help guys :-)

1 Answers

TabeaKischka On

Your Rscript myscript.R:

args = commandArgs(trailingOnly=TRUE)
df <- read.csv(args[1], header=TRUE)
working <- df[!is.na(df$Error),]
working <- sapply(working, as.character)
working[is.na(working)] <- ""
write.xlsx(working,paste(args[1], "test.xlsx", sep="_")

now, if you are running a Unix-system, you can open the terminal and run the following to start a for loop for all files that end with ".CSV" in the folder /folder/with/input/data:

cd /folder/with/input/data
for FILE in *.CSV
  Rscript myscript.R $FILE