numpy genfromtxt - missing data vs bad data

246 views Asked by At

I'm using numpy genfromtxt, and I need to identify both missing data and bad data. Depending on user input, I may want to drop bad value or raise error. Essentially, I want to treat missing and bad data as the same thing.

Say I have a file like this, where the columns are of data types "date, int, float"

date,id,value
2017-12-4,0,       # BAD. missing data
2017-12-4,1,XYZ    # BAD. value should be float, not string. 
2017-12-4,2,1.0    # good
2017-12-4,3,1.0    # good
2017-12-4,4,1.0    # good

I would like to detect both. So, I do this

dtype=(np.dtype('<M8[D]'), np.dtype('int64'), np.dtype('float64'))
result = np.genfromtxt(filename, delimiter=',', dtype=dtype, names=True, usemask=True, usecols=('date', 'id', 'value'))

And the result is this

masked_array(data=[(datetime.date(2017, 12, 4), 0, --),
               (datetime.date(2017, 12, 4), 1, nan),
               (datetime.date(2017, 12, 4), 2, 1.0),
               (datetime.date(2017, 12, 4), 3, 1.0),
               (datetime.date(2017, 12, 4), 4, 1.0)],
         mask=[(False, False,  True), (False, False, False),
               (False, False, False), (False, False, False),
               (False, False, False)],
   fill_value=('NaT', 999999, 1.e+20),
        dtype=[('date', '<M8[D]'), ('id', '<i8'), ('value', '<f8')])

I thought the whole point of masked_array is that it can handle missing data AND bad data. But here, it's only handling missing data.

result['value'].mask

returns

array([ True, False, False, False, False])

The "bad" data actually still got into the array, as nan. I was hoping the mask would give me True True False False False.

In order for me to realize we have a bad value on the 2nd row, I need to do additional work, like check for nan.

another_mask = np.isnan(result['value'])
good_result = result['value'][~another_mask]

Finally, this returns

masked_array(data=[1.0, 1.0, 1.0],
         mask=[False, False, False],
   fill_value=1e+20)

That works, but I feel like I'm doing something wrong. The whole point of maskedArray is to find missing AND bad data, but I'm somehow only using it to find missing data. And I need my own check to find bad data. Feels ugly and not-pythonic.

Is there a way to find both at the same time?

1

There are 1 answers

1
hpaulj On

Playing around with a simple input:

In [143]: txt='''1,2
     ...: 3,nan
     ...: 4,foo
     ...: 5,
     ...: '''.splitlines()
In [144]: txt
Out[144]: ['1,2', '3,nan', '4,foo', '5,']

By specifying a specific string as 'missing' (it may be a list?), I can 'mask' it, along with blank:

In [146]: np.genfromtxt(txt,delimiter=',', missing_values='foo', 
       usemask=True, usecols=1)
Out[146]: 
masked_array(data=[2.0, nan, --, --],
             mask=[False, False,  True,  True],
       fill_value=1e+20)

It looks like it converted all values with float, but generated the mask based on the strings (or lack there of):

In [147]: _.data
Out[147]: array([ 2., nan, nan, nan])

I can replace both types of 'missing' with a specific value. Since it's doing a float conversion, the fill has to be 100 or '100':

In [151]: np.genfromtxt(txt,delimiter=',', missing_values='foo', 
    usecols=1, filling_values=100)
Out[151]: array([  2.,  nan, 100., 100.])

In a more complex case I can imagine writing a converter for the column. I've only dabbled in that feature.

The documentation for these parameters is slim, so figuring out what combinations work, and in what order, is a matter of trial-and-error (or a lots of code digging).

More details in the follow up question: numpy genfromtxt - how to detect bad int input values