I'm trying to read in a BZ2 file from the Reddit Politosphere dataset (specifically the "comments_2008-01.bz2" file). The dataset contains, among other things, the body of a Reddit comment.
If I read the file in using read.csv
, it works well for the most part, except for a few lines where it incorrectly splits what should be one entry into multiple columns.
df <- read.csv(bzfile("comments_2008-01.bz2"), fill = T)
df[9, ]
What happens:
body..deleted | body..cleaned |
---|---|
We ended it in 2004, but they stole it back. Google \\Ohio voting results, | 2004.\\ |
What I would like to happen:
body..deleted |
---|
We ended it in 2004, but they stole it back. Google \\Ohio voting results, 2004.\\ |
When I use read_lines
to explore:
"{\"author\":\"nOD1S\",\"body\\":\"We ended it in 2004, but they stole it back. Google \\\"Ohio voting results, 2004.\\\"\", ..... }"
What I think is happening is that in \"Ohio voting results, 2004
, the \"
is unintentionally telling the parser that the entry is completed, which is why the next ,
forces the rest of the text into a new column.
I can think of hacky ways to delete these rows altogether, but I don't really want to do that. Any ideas about how to get around this issue?
As your data is not a csv file after "unzipping", but a json file (sadly not formatted 100% correctly). So we can use readLines and then interpret each line as JSON and convert it using the jsonlite package.