How do you allow for text qualifiers using numpy genfromtxt

966 views Asked by At

I am currently trying to import some comma delimited text data into an array using the numpy library in Python. I am using the following code:

data = np.genfromtxt(fname, delimiter=',')

I get the following error:

Line #2 (got 12 columns instead of 11)

for every line after the header.

The reason for this appears to be that one of the columns contains a comma, but attempts to deal with this using text qualifiers (") around the data for that column. If I used the Python csv library this is handled by default e.g.:

reader = csvreader(open(fname, 'rb'))

I know that I could import the data using the csv library and then convert it to an array, but I wondered if it is possible to do this from one of numpy's functions that convert text data to an array such as genfromtxt. I have checked out the help on genfromtxt but none of the arguments listed appear to describe what I was looking for, unless I am missing something.

In case it helps here is a sample of a few lines from the file:

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S

It is the name column that I assume is causing the issue.

2

There are 2 answers

0
chthonicdaemon On BEST ANSWER

Numpy arrays are not well-suited for categorical data like you have here. You may be better off using pandas:

import pandas
data = pandas.read_csv(fname)
0
Lee On

One way around this is to add another name field, so that you have thirteen name fields with a separate forename and surname column:

survived,pclass,surname,forname,sex,age,sibsp,parch,ticket,fare,cabin,embarked

If you then import like so:

data = np.genfromtxt(fname, delimiter=',',names=True,dtype=None)

It should work:

data['surname']
array(['"Braund', '"Cumings', '"Heikkinen'], 
      dtype='|S10')

Note that you may also want to stip out the " marks in the original file.