How to quote a diverse data.frame in R

135 views Asked by At

I have 2.000+ tables, some with hundreds of lines, that I'm downloading from a web service (of botanical names) and saving to disk for further inspection.

Since some text fields have carriage returns, I decided to quote everything. But some fields have " characters, others have ' characters, so these characters can't be used for quoting (I could try to escape them, but some are already escaped, and this would easily become a mess. I thought it would be easier to use a different quote character). I tried %, only to find that some fields also use this character. So I need something different. I tried ¨ ☺ π and 人, but nothing seems to work! All of them appear correctly on screen (RKWard in Ubuntu 14.04), all are saved correctly with write.table, but NONE can be read with read.table or read.csv. I'm using UTF-8 as fileEncoding. I get the message "invalid multibyte string", even for ☺ (which is ASCII 1st character).

Sys.getlocale(category="LC_ALL")

gives

"LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=pt_BR.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=pt_BR.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=pt_BR.UTF-8;LC_NAME=pt_BR.UTF-8;LC_ADDRESS=pt_BR.UTF-8;LC_TELEPHONE=pt_BR.UTF-8;LC_MEASUREMENT=pt_BR.UTF-8;LC_IDENTIFICATION=pt_BR.UTF-8"

I have tried changing the locale to chinese, to use the 人 (what shouldn't be needed, I guess, since it displays and saves correctly), but also didn't work. I get

OS reports request to set locale to "chinese" cannot be honored
OS reports request to set locale to "Chinese" cannot be honored
OS reports request to set locale to "zh_CN.utf-8" cannot be honored

Now the most strange: if the chinese characters are in the body of data, they're read without problem. It seems they just can't go as quotes!

Any ideas? Thanks in advance.

2

There are 2 answers

0
Molx On

I'm not sure this is the solution you're looking for, but if I understood correctly you have CR/LF characters in your text which are a problem to read the data as a table. If so, you can use readLines which automatically escapes \r, \n and \r\n and then read as a table. For example, consider the file crlf.txt:

col1 col2 col3 col4 col5
1 \n 3 \r 5 
a \r\n 3 2 2

You can use

> readLines("crlf.txt")
[1] "col1 col2 col3 col4 col5" "1 \\n 3 \\r 5 "          
[3] "a \\r\\n 3 2 2"

And then:

> read.table(text=readLines("crlf.txt"), header = T)
  col1   col2 col3 col4 col5
1    1    \\n    3  \\r    5
2    a \\r\\n    3    2    2

Obviously the line breaks are now escaped when printed, otherwise they would actually break the lines.

0
Nick Kennedy On

See ?scan (scan is used by read.table):

quote: the set of quoting characters as a single character string or ‘NULL’. In a multibyte locale the quoting characters must be ASCII (single-byte).

The easiest option would be to replace all your embedded new lines with another string prior to importing the file, and then reintroduce the new lines later using gsub.