I am trying to read in a large csv datafile (delimited by ,), and I keep on getting stuck on rows such as the following: link to raw file: "http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv"
| RIN | UCR_Group | Loc |
|--------|-----------|---------------------------------------------|
|490658 | Property | 400 BLOCK 17TH ST, 5TH FL |
|506928 | Disorder | 1400 BLOCK W COLORADO AVE, COLORADO SPRINGS |
As you can see, the delimiter for the file is used in the column as well. Is there a way to specify the delimiter for a column as a regular expression to read in the file, or would I need to look into using read.fwf to find the max length of each field, and parse the data using that? Currently, this is my code that I have came up with so far:
datafile <- "http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv"
new <-readr::read_delim(datafile, sep ='[\\S],[\\S]')
new <-read.table(datafile, sep ='[\\S],[\\S]' )
Should I be using read.fwf, or trying to pull out the problem column manually? Any help would be appreciated.
Edit:
For Bonus points, I'd really like to build a function that would detect columns that are bad in a csv file, or data that looks like it may mess up the structure of the file, such as this case. That way, I don't have to mess around with a text editor, and can programmatically find these errors in the file. Thoughts on how to build something like this?
Using
panda.read_csv
and regex negative look ahead. The same regex should work inR
as well.Filter
df
for rows in whichLOC
has a comma, to verify that we've parsed correctly: