Incorrect CSV parsing when reading in BZ2 file

85 views Asked by At

I'm trying to read in a BZ2 file from the Reddit Politosphere dataset (specifically the "comments_2008-01.bz2" file). The dataset contains, among other things, the body of a Reddit comment.

If I read the file in using read.csv, it works well for the most part, except for a few lines where it incorrectly splits what should be one entry into multiple columns.

df <- read.csv(bzfile("comments_2008-01.bz2"), fill = T)
df[9, ]

What happens:

body..deleted body..cleaned
We ended it in 2004, but they stole it back. Google \\Ohio voting results, 2004.\\

What I would like to happen:

body..deleted
We ended it in 2004, but they stole it back. Google \\Ohio voting results, 2004.\\

When I use read_lines to explore:

"{\"author\":\"nOD1S\",\"body\\":\"We ended it in 2004, but they stole it back. Google \\\"Ohio voting results, 2004.\\\"\",    ..... }"

What I think is happening is that in \"Ohio voting results, 2004, the \" is unintentionally telling the parser that the entry is completed, which is why the next , forces the rest of the text into a new column.

I can think of hacky ways to delete these rows altogether, but I don't really want to do that. Any ideas about how to get around this issue?

1

There are 1 answers

0
Merijn van Tilborg On

As your data is not a csv file after "unzipping", but a json file (sadly not formatted 100% correctly). So we can use readLines and then interpret each line as JSON and convert it using the jsonlite package.

library("jsonlite")

data <- readLines(bzfile("comments_2008-01.bz2"))

tbl <- do.call(rbind, lapply(
  data,
  function(x) as.data.frame(fromJSON(x)))
)