Filter a CSV file that has text above column names that must be maintained after the filter process

43 views Asked by At

I work with hundreds of CSV files that have three lines of text above multiple column headers.

CSV example.

These files can be as large as 300MB. They need to be filtered down to < 1/3 the size in local preprocessing, before the result is pushed to a server for processing. Filtering is fine when the three lines are removed on upload.

A <- read_csv_arrow( p , as_data_frame = TRUE, skip = 3) 

But the problem is replacing the text (the first three lines) in the same state post filtering, as that information is relevant in a later processing.

I have tried multiple approaches isolating the text as an object then using cat() to replace it after filtering, but the process must end with a CSV, which I have been unsuccessful achieving.

2

There are 2 answers

1
Edward On

Use write.table, twice.

Suppose you have stored the first three lines into an object called text. Write that to a csv file with no row or column headers:

write.table(text, file="filename.csv", col.names=FALSE, row.names=FALSE, sep=",")

Then append the filtered data frame called, say mydata, to this csv file.

write.table(mydata, file="filename.csv", sep=",", append=TRUE)
3
Rui Barradas On

The following does not use read_csv_arrow, it uses readLines/read.csv to read in the file. This is meant to return a named list object with two members:

  1. header are the first 3 lines;
  2. data is the table that follows the first 3 lines.

Then, function write_special will write a list with those members.

File "so_test.csv" content:

Text 1st line
Text 2nd line
Text 3rd line
a,b,c
1,5,9
2,6,10
3,7,11
4,8,12

Code

In read_special you can replace read.csv by read_csv_arrow but the return value must be a list with those named members. The header can be read with readLines, then skip the first 3 lines before reading the table.

Pass that list after pre-processing the table and write the list back to disk with write_special.

read_special <- function(file, n = 3L, skip = n, ...) {
  x <- readLines(file, n = n)
  y <- read.csv(file, skip = skip, ...)
  list(header = x, data = y)
}

write_special <- function(x, file, ...) {
  writeLines(text = x$header, con = file)
  write.table(x$data, file = file, sep = ",", append = TRUE, ...)
}

path <- "~/Temp"
infile <- file.path(path, "so_test.csv")
outfile <- file.path(path, "so_test_2.csv")

x <- read_special(infile)
x
# $header
# [1] "Text 1st line" "Text 2nd line" "Text 3rd line"
#
# $data
#   a b  c
# 1 1 5  9
# 2 2 6 10
# 3 3 7 11
# 4 4 8 12

write_special(x, file = outfile, quote = FALSE, row.names = FALSE, col.names = TRUE)
Warning message:  
In write.table(x$data, file = file, sep = ",", append = TRUE, ...) :  
  appending column names to file

File "so_test_2.csv" content:

The output file is identical to the input file.

Text 1st line
Text 2nd line
Text 3rd line
a,b,c
1,5,9
2,6,10
3,7,11
4,8,12