csv file not read correctly (almost half of the rows are removed)

56 views Asked by At

I want to read in a csv file with R. It has a semicolon as the separator and uses double quotes as quotes.

However, the file has some problematic cells (e.g. some have one double quote in text cells or an semicolon in the text) which leads to R not reading the file in correctly.

Since I can't upload a file, here's how the csv looks like when opened with a text editor:

"X1";"X2";"X3"
"1";"0";"test"
"37129";"11746; weird";"weird "thing""
"27929";"1";"xyz Limited"

Row 1 contain the headlines. In row 3 you can see that the X2 value has a semicolon after "11746" and the X3 value has an additional double quote before "thing".

So when reading this file in with readr::read_csv2(file = my_file.csv) in this small example it looks ok, but it actually isn't (note that my real file has 13k rows). For example, we'd expect to see double quotes before and after "thing".

When I use readr::read_csv2(file = my_file.csv, quote = "") instead, row 2's values are shifted due to the semicolon.

Any idea, how I could solve this issue?

The expected output would be:

# A tibble: 3 × 3
     X1 X2           X3             
  <dbl> <chr>        <chr>          
1     1 0            "test"         
2 37129 11746; weird "\"weird thing\""
3 27929 1            "xyz Limited"
3

There are 3 answers

1
r2evans On BEST ANSWER

That is not standards-compliant semi-colon-CSV. The "thing" quotes should be doubled to be escaped, as in "37129";"11746; weird";"weird ""thing""". Whatever is creating that file is not adhering to the standard.

We can try to use regex to parse the contents, changing the embedded " quotes to double "" (and not changing the enclosing "s), perhaps this:

## on the shell, not in R
$ sed -E 's/([^;])"([^;])/\1""\2/g' quux.csv > newquux.csv
## in R
read.csv2("newquux.csv")
#      X1           X2            X3
# 1     1            0          test
# 2 37129 11746; weird weird "thing"
# 3 27929            1   xyz Limited

One might be tempted to do this in R instead, which is generally fine. However, if the csv file is huge, then this will cause an unnecessary growth in the R's memory consumption ... again, only an issue if your file is huge.

read.csv2(
  text = gsub('(?<=[^;])"(?=[^;])', '""',
              readLines("quux.csv"), perl = TRUE))
#      X1           X2            X3
# 1     1            0          test
# 2 37129 11746; weird weird "thing"
# 3 27929            1   xyz Limited

Both of the above work just as well with readr::read_csv2:

readr::read_csv2("newquux.csv")
readr::read_csv2(
  I(gsub('(?<=[^;])"(?=[^;])', '""',
         readLines("quux.csv"), perl = TRUE)))

(The I(..) is necessary here since otherwise it opts to look for a file named "X1";"X2";"X3".)

2
cnluzon On

I think what you need is to speficy " as the quote character, I got what I think you want with this:

read.table("test.csv", sep=";", quote = "\"", header= FALSE)

However, the quotes are removed as they are quoting characters, do you want to keep them as well?

The output:

     V1           V2          V3
1    X1           X2          X3
2     1            0        test
3 37129 11746; weird weird thing
4 27929            1 xyz Limited
0
G. Grothendieck On

This is a non-standard file format but we can read it in using readLines and then replace the delimiting quotes with some other character not in the file -- here we use a single quote as that character.

"my_file.csv" |>
  readLines() |>
  gsub('^"|"$', "'", x = _) |>  # replace " with ' at start & end of each line
  gsub('";"', "';'", x = _) |>  # replace ";" with ';'
  read.csv2(text = _, quote = "'")
##      X1           X2            X3
## 1     1            0          test
## 2 37129 11746; weird weird "thing"
## 3 27929            1   xyz Limited

We can use this code to check which characters are in the file. Any character not listed can be used as the new quote character.

"my_file.csv" |>
  readLines() |>
  strsplit("") |>
  unlist() |>
  table() |>
  names() |>
  cat("\n")
## " ; 0 1 2 3 4 6 7 9 d e g h i L m n r s t w x X y z 

or we can use this code to check whether a particular character, here single quote, appears in the file. If 1 is returned as it does with the example input in the question then that character does not appear and otherwise it does.

max(count.fields("my_file.csv", sep = "'", quote = ""))
## [1] 1