Write RFC4180 compatible flat files with less agressive quoting

275 views Asked by At

When using write.table or write.csv in R, double quotes are added around all non-numeric fields by default regardless of whether the quotes are actually required for proper parsing of the csv file.

Take a Python script as an example:

import csv
f_out=open("pytest.csv", "w")
wri = csv.writer(f_out, delimiter=',')
wri.writerow(['c_numeric', 'c_str', 'c_str_spec'])
wri.writerow([11, "r1c2", "r1c3 nothing special"])
wri.writerow([21, "r2c2", "r2c3,with delim"])
wri.writerow([31, "r3c2", "r3c3\nwith carriage return"])
wri.writerow([41, "r4c2", "r3c3\"with double quote"])
f_out.close()

This will output the following into pytest.csv:

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,"r2c3,with delim"
31,r3c2,"r3c3
with carriage return"
41,r4c2,"r3c3""with double quote"

This is what I expect and follows what Excel will output also.

Now let's process this file using R and write with and without quotes:

df <- read.csv("pytest.csv")
write.csv(df, 'Rtest.csv', row.names=FALSE)
write.csv(df, 'Rtest_NQ.csv', row.names=FALSE, quote=FALSE)

Here is Rtest.csv:

"c_numeric","c_str","c_str_spec"
11,"r1c2","r1c3 nothing special"
21,"r2c2","r2c3,with delim"
31,"r3c2","r3c3
with carriage return"
41,"r4c2","r3c3""with double quote"

Notice the quotes around all non-numeric fields.

Here is Rtest_NQ.csv:

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,r2c3,with delim
31,r3c2,r3c3
with carriage return
41,r4c2,r3c3"with double quote

This file is technically corrupt as it's unreadable by any csv reader — so not a good option.

My question: Is there any rfc4180 compatible writer in R that writes like that of Excel or python csv library and most other rfc4180 compatible tools?

1

There are 1 answers

4
Allan Cameron On

You can write a simple function to construct a csv by converting the data frame to a character matrix, escaping any double quotes then quoting any strings containing commas or line breaks. You then add the column names and write as csv with writeLines

You can even choose a custom delimiter, which will work as long as it is not esoteric enough to be misinterpreted as a regex.

write_unquoted <- function(df, path, delim = ",")
{
  regexp <- paste0(delim, "|\n")
  x <- as.matrix(df) 
  x[grep("\"", x)] <- paste0("\"", gsub("\"", "\"\"", x[grep("\"", x)]), "\"")
  x[grep(regexp, x)]  <- paste0("\"", x[grep(regexp, x)], "\"")
  x <- c(paste0(colnames(x), collapse = delim), apply(x, 1, paste0, collapse = delim))
  writeLines(x, path)
}

So if we start with your example:

df
#>   c_numeric c_str                 c_str_spec
#> 1        11  r1c2       r1c3 nothing special
#> 2        21  r2c2            r2c3,with delim
#> 3        31  r3c2 r3c3\nwith carriage return
#> 4        41  r4c2     r3c3"with double quote

and we do

write_unquoted(df, "my.csv")

We can see it faithfully stores the data frame:

identical(read.csv("my.csv"),  df)
#> [1] TRUE

and if we look at the produced csv, it looks like this:

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,"r2c3,with delim"
31,r3c2,"r3c3
with carriage return"
41,r4c2,"r3c3""with double quote"

that is, only quoted when needed.

I don't know if there are any counterexamples where this simple method isn't RFC4180 compatible.