When using write.table
or write.csv
in R, double quotes are added around all non-numeric fields by default regardless of whether the quotes are actually required for proper parsing of the csv file.
Take a Python script as an example:
import csv
f_out=open("pytest.csv", "w")
wri = csv.writer(f_out, delimiter=',')
wri.writerow(['c_numeric', 'c_str', 'c_str_spec'])
wri.writerow([11, "r1c2", "r1c3 nothing special"])
wri.writerow([21, "r2c2", "r2c3,with delim"])
wri.writerow([31, "r3c2", "r3c3\nwith carriage return"])
wri.writerow([41, "r4c2", "r3c3\"with double quote"])
f_out.close()
This will output the following into pytest.csv
:
c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,"r2c3,with delim"
31,r3c2,"r3c3
with carriage return"
41,r4c2,"r3c3""with double quote"
This is what I expect and follows what Excel will output also.
Now let's process this file using R and write with and without quotes:
df <- read.csv("pytest.csv")
write.csv(df, 'Rtest.csv', row.names=FALSE)
write.csv(df, 'Rtest_NQ.csv', row.names=FALSE, quote=FALSE)
Here is Rtest.csv
:
"c_numeric","c_str","c_str_spec"
11,"r1c2","r1c3 nothing special"
21,"r2c2","r2c3,with delim"
31,"r3c2","r3c3
with carriage return"
41,"r4c2","r3c3""with double quote"
Notice the quotes around all non-numeric fields.
Here is Rtest_NQ.csv
:
c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,r2c3,with delim
31,r3c2,r3c3
with carriage return
41,r4c2,r3c3"with double quote
This file is technically corrupt as it's unreadable by any csv reader — so not a good option.
My question: Is there any rfc4180 compatible writer in R that writes like that of Excel or python csv library and most other rfc4180 compatible tools?
You can write a simple function to construct a csv by converting the data frame to a character matrix, escaping any double quotes then quoting any strings containing commas or line breaks. You then add the column names and write as csv with
writeLines
You can even choose a custom delimiter, which will work as long as it is not esoteric enough to be misinterpreted as a regex.
So if we start with your example:
and we do
We can see it faithfully stores the data frame:
and if we look at the produced csv, it looks like this:
that is, only quoted when needed.
I don't know if there are any counterexamples where this simple method isn't RFC4180 compatible.