R - looping through data.frames in a list - modify characters of column (list element)

79 views Asked by At

I have a couple of thousand *.csv files (all with a unique name) but the header - columns are equal in the files - like "Timestamp", "System_Name", "CPU_ID", etc...
My question is how can I either replace the "System_Name" (which is a system name like "as12535.org.at" or any other character combination, and anonymize this ? I am grateful for any hint or point in the right direction ...
Below the structure of a CSV file...

"Timestamp","System_Name","CPU_ID","User_CPU","User_Nice_CPU","System_CPU","Idle_CPU","Busy_CPU","Wait_IO_CPU","User_Sys_Pct"  
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26" 
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26" 
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26"

I tried it with the R package anonymizer which works fine on the vector level, but i ran into issues doing this for thousands of csv files that i was reading in R - what i tried was the following - creating a list with all the csv Files as dataframes inside the list.

initialize a list
r.path <- setwd("mypath")
ldf <- list()

# creates the list of all the csv files in my  directory - but filter for 
# files with Unix in the filename for testing.
listcsv <- dir(pattern = ".UnixM.") 

for (i in 1:length(listcsv)){
ldf[[i]] <- read.csv(file = listcsv[i]) 
}

I was twisting my brain to death, as i could not anonymize the System_Name column, or even replace some characters (for pseudo-anonymization) and loop through the list (ldf) and the dataframe elements of that very list.

My list ldf (containing the df for the single csv files) looks like this:

summary(ldf)
Length Class      Mode
[1,] 5      data.frame list
[2,] 5      data.frame list
[3,] 5      data.frame list

showing the structure of my list, containing all files contents as dataframe

How can I now read in all the CSV files, change or anonymize the entire or even parts of the "System_Name" columns, and do this for each and every CSV in my directory, in a loop in R ? Doesn't need to be super elegant - am happy when it does the job :-)

1

There are 1 answers

0
Aurèle On BEST ANSWER

A common pattern for doing this would be:

df <- do.call(
  rbind,
  lapply(dir(pattern = "UnixM"), 
         read.csv, stringsAsFactors = FALSE)
)
df$System_Name <- anonymizer::anonymize(df$System_Name)

It differs from what you were trying, in that it binds all the data frames in one, then anonymize.

Of course you can keep everything in a list, like @S Rivero suggests. It would look like:

listdf <- lapply(
  dir(pattern = "UnixM"),
  function(filename) {
    df <- read.csv(filename, stringsAsFactors = FALSE)
    df$System_Name <- anonymizer::anonymize(df$System_Name)
    df
  }
)