I'm using numpy genfromtxt, and I need to identify both missing data and bad data. Depending on user input, I may want to drop bad value or raise error. Essentially, I want to treat missing and bad data as the same thing.
Say I have a file like this, where the columns are of data types "date, int, float"
date,id,value
2017-12-4,0, # BAD. missing data
2017-12-4,1,XYZ # BAD. value should be float, not string.
2017-12-4,2,1.0 # good
2017-12-4,3,1.0 # good
2017-12-4,4,1.0 # good
I would like to detect both. So, I do this
dtype=(np.dtype('<M8[D]'), np.dtype('int64'), np.dtype('float64'))
result = np.genfromtxt(filename, delimiter=',', dtype=dtype, names=True, usemask=True, usecols=('date', 'id', 'value'))
And the result is this
masked_array(data=[(datetime.date(2017, 12, 4), 0, --),
(datetime.date(2017, 12, 4), 1, nan),
(datetime.date(2017, 12, 4), 2, 1.0),
(datetime.date(2017, 12, 4), 3, 1.0),
(datetime.date(2017, 12, 4), 4, 1.0)],
mask=[(False, False, True), (False, False, False),
(False, False, False), (False, False, False),
(False, False, False)],
fill_value=('NaT', 999999, 1.e+20),
dtype=[('date', '<M8[D]'), ('id', '<i8'), ('value', '<f8')])
I thought the whole point of masked_array is that it can handle missing data AND bad data. But here, it's only handling missing data.
result['value'].mask
returns
array([ True, False, False, False, False])
The "bad" data actually still got into the array, as nan. I was hoping the mask would give me True True False False False
.
In order for me to realize we have a bad value on the 2nd row, I need to do additional work, like check for nan.
another_mask = np.isnan(result['value'])
good_result = result['value'][~another_mask]
Finally, this returns
masked_array(data=[1.0, 1.0, 1.0],
mask=[False, False, False],
fill_value=1e+20)
That works, but I feel like I'm doing something wrong. The whole point of maskedArray is to find missing AND bad data, but I'm somehow only using it to find missing data. And I need my own check to find bad data. Feels ugly and not-pythonic.
Is there a way to find both at the same time?
Playing around with a simple input:
By specifying a specific string as 'missing' (it may be a list?), I can 'mask' it, along with blank:
It looks like it converted all values with
float
, but generated the mask based on the strings (or lack there of):I can replace both types of 'missing' with a specific value. Since it's doing a
float
conversion, the fill has to be100
or'100'
:In a more complex case I can imagine writing a converter for the column. I've only dabbled in that feature.
The documentation for these parameters is slim, so figuring out what combinations work, and in what order, is a matter of trial-and-error (or a lots of code digging).
More details in the follow up question: numpy genfromtxt - how to detect bad int input values