Reading in Poor CSV File Structure

297 views Asked by At

I am trying to read in a large csv datafile (delimited by ,), and I keep on getting stuck on rows such as the following: link to raw file: "http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv"

| RIN    | UCR_Group | Loc                                         |
|--------|-----------|---------------------------------------------|
|490658  | Property  | 400 BLOCK 17TH ST, 5TH FL                   |
|506928  | Disorder  | 1400 BLOCK W COLORADO AVE, COLORADO SPRINGS |

As you can see, the delimiter for the file is used in the column as well. Is there a way to specify the delimiter for a column as a regular expression to read in the file, or would I need to look into using read.fwf to find the max length of each field, and parse the data using that? Currently, this is my code that I have came up with so far:

datafile <- "http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv"
new <-readr::read_delim(datafile, sep ='[\\S],[\\S]')  
new <-read.table(datafile, sep ='[\\S],[\\S]' )

Should I be using read.fwf, or trying to pull out the problem column manually? Any help would be appreciated.

Edit:

For Bonus points, I'd really like to build a function that would detect columns that are bad in a csv file, or data that looks like it may mess up the structure of the file, such as this case. That way, I don't have to mess around with a text editor, and can programmatically find these errors in the file. Thoughts on how to build something like this?

4

There are 4 answers

1
piRSquared On BEST ANSWER

Using panda.read_csv and regex negative look ahead. The same regex should work in R as well.

import pandas as pd

df = pd.read_csv(filename, sep=r',(?!\s)')

Filter df for rows in which LOC has a comma, to verify that we've parsed correctly:

df[df.LOC.str.contains(',')]

enter image description here

3
hrbrmstr On

You know which field has the non-escaped commas:

library(stringi)
library(purrr)

txt <- readr::read_lines("http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv")
commas <- stri_locate_all_fixed(txt, ",")

map2_chr(txt[1:100], commas[1:100], function(x, y) {
  len <- nrow(y)
  bits <- c(1:6, (len-6):len)
  for (i in bits) { stri_sub(x, y[i,1], y[i,2]) <- ";" }
  x
}) -> rd

read.table(text=rd, header=TRUE, sep=";", stringsAsFactors=FALSE) %>%
  dplyr::glimpse()
## Observations: 99
## Variables: 14
## $ RIN             <int> 416667, 416673, 416674, 416680, 416684, 416686...
## $ UCR_Group       <chr> "Crimes Against Property", "Crimes Against Per...
## $ UCR_Cat         <chr> "Criminal Mischief/Damaged Property", "Forcibl...
## $ EXP_TRANSLATION <chr> "CRIMINAL MISCHIEF - MTR VEH", "SEX ASLT - RAP...
## $ OCC_DATE        <chr> "1/1/2010 0:00:00", "1/1/2010 0:00:00", "1/1/2...
## $ OCC_TIME        <int> 145, 300, 500, 730, 200, 440, 100, 851, 140, 2...
## $ LOC.TRANSLATION <chr> "200 BLOCK S ZENOBIA ST,VICTIM     ", "1500 BL...
## $ PIN             <int> 235602181, 219220590, 119013720, 174326399, 32...
## $ DOB             <chr> "5/6/1979 0:00:00", "3/19/1988 0:00:00", "5/25...
## $ SEX             <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "...
## $ X               <dbl> 3126041, 3134341, 3134360, 3127695, 3193317, 3...
## $ Y               <dbl> 1684997, 1676360, 1700160, 1682545, 1708673, 1...
## $ LON             <dbl> -105.05, -105.02, -105.02, -105.04, -104.81, -...
## $ LAT             <dbl> 39.71, 39.68, 39.75, 39.70, 39.77, 39.78, 39.7...
5
G. Grothendieck On

Replace each comma that is surrounded by non-space with semicolon and then read in the result using read.csv2.

(Replace Lines with readLines(u) command to read it from u. Also if there are semicolons in the file then use a different character and specify it in the sep= argument to read.csv2 or just read.csv and in the second arg to gsub.)

read.csv2(text = gsub(",(\\S)", ";\\1", Lines)))

giving:

     RIN UCR_Group                                         Loc
1 490658  Property                   400 BLOCK 17TH ST, 5TH FL
2 506928  Disorder 1400 BLOCK W COLORADO AVE, COLORADO SPRINGS

Note: We used this as the input Lines:

Lines <- c("RIN,UCR_Group,Loc", 
  "490658,Property,400 BLOCK 17TH ST, 5TH FL", 
  "506928,Disorder,1400 BLOCK W COLORADO AVE, COLORADO SPRINGS")

Update: It seems in the actual file that a space can appear before a valid comma separator and there are one character fields so we modified the pattern accordingly. Here is the result for the first 3 lines of the file:

u <- "http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv"
Lines <- readLines(u, 3)

read.csv2(text = gsub(",(\\S)", ";\\1", Lines))

giving:

     RIN               UCR_Group                            UCR_Cat
1 416667 Crimes Against Property Criminal Mischief/Damaged Property
2 416673  Crimes Against Persons              Forcible Sex Offenses
              EXP_TRANSLATION         OCC_DATE OCC_TIME                     LOC
1 CRIMINAL MISCHIEF - MTR VEH 1/1/2010 0:00:00      145  200 BLOCK S ZENOBIA ST
2             SEX ASLT - RAPE 1/1/2010 0:00:00      300 1500 BLOCK S DECATUR ST
  TRANSLATION       PIN               DOB SEX          X          Y     LON
1 VICTIM      235602181  5/6/1979 0:00:00   M 3126041.08 1684996.73 -105.05
2 ARRESTEE    219220590 3/19/1988 0:00:00   M 3134340.56 1676360.06 -105.02
    LAT
1 39.71
2 39.68
0
thelatemail On

Here's a worked example showing you can use a regex to parse this file, relying on the fact that the commas in the addresses have a space. This of course gets more complicated if this rule doesn't always hold:

txt <- "RIN,UCR_Group,Loc
123456,Property,1 STREET
490658,Property,400 BLOCK 17TH ST, 5TH FL
506928,Disorder,1400 BLOCK W COLORADO AVE, COLORADO SPRINGS"

dat <- readLines(textConnection(txt))
# in a real example:
# dat <- readLines("filename.csv")

spl <- strsplit(dat, "(?<=\\S),(?=\\S)", perl=TRUE)
setNames(data.frame(do.call(rbind, spl[-1])), spl[[1]])

#     RIN UCR_Group                                         Loc
#1 123456  Property                                    1 STREET
#2 490658  Property                   400 BLOCK 17TH ST, 5TH FL
#3 506928  Disorder 1400 BLOCK W COLORADO AVE, COLORADO SPRINGS