numpy - Change/Specify dtypes of masked array columns

459 views Asked by At

I have a csv-file containing a lot of data that I want to read as a masked array. I've done so using the following:

data=np.recfromcsv(filename,case_sensitive=True,usemask=True)

which works just fine. However, my problem is that the data are either strings, integers, or floats. What I want to do now is convert all the integers into floats, i.e. turn all the "1"s into "1.0"s etc. while preserving everything else.

Additionally, I am looking for a generic solution. So simply specifying the desired types manually won't do since the csv-file (including the number of columns) changes.

I've tried astype but since the array also has string-entries that won't work, or am I missing something?

Thanks.

1

There are 1 answers

0
hpaulj On BEST ANSWER

I haven't used recfromcsv, but looking at its code I see it uses np.genfromtxt, followed by a masked records construction.

I'd suggest giving a small sample csv text (3 or so lines), and show the resulting data. We need to see the dtype in particular.

It may also be useful to start with genfromtxt, skipping the masked array stuff for now. I don't think that's where the sticky point is in converting dtypes in structured arrays.

In any case, we need something more concrete to explore.

You can't change the dtype of structured fields in-place. You have to make a new array with a new dtype, and copy values from the old to the new.

import numpy.lib.recfunctions as rf

has some functions that can help in changing structured arrays.

===========

I suspect that it will be simpler to spell out the dtypes when calling genfromtxt than to change dtypes in an existing array.

You could try one read with the dtype=None and limited number of lines to get the column count and base dtype. Then edit that, substituting floats for ints as needed. Now read the whole with the new dtype. Look in the recfunctions code if you need ideas on how to edit dtypes.

For example:

In [504]: txt=b"""a, 1, 2, 4\nb, 6, 9, 10\nc, 4, 4, 3"""

In [506]: arr = np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',')
In [507]: arr
Out[507]: 
array([(b'a', 1, 2, 4), (b'b', 6, 9, 10), (b'c', 4, 4, 3)], 
      dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [508]: arr.dtype.descr
Out[508]: [('f0', '|S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')]

A crude dtype editor:

def foo(tup):
    name, dtype=tup
    dtype = dtype.replace('S','U')
    dtype = dtype.replace('i','f')
    return name, dtype

And applying this to default dtype:

In [511]: dt = [foo(tup) for tup in arr.dtype.descr]
In [512]: dt
Out[512]: [('f0', '|U1'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<f4')]

In [513]: arr = np.genfromtxt(txt.splitlines(), dtype=dt, delimiter=',')
In [514]: arr
Out[514]: 
array([('a', 1.0, 2.0, 4.0), ('b', 6.0, 9.0, 10.0), ('c', 4.0, 4.0, 3.0)], 
      dtype=[('f0', '<U1'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<f4')])

In [522]: arr = np.recfromcsv(txt.splitlines(), dtype=dt, delimiter=',',case_sensitive=True,usemask=True,names=None)
In [523]: arr
Out[523]: 
masked_records(
    f0 : ['a' 'b' 'c']
    f1 : [1.0 6.0 4.0]
    f2 : [2.0 9.0 4.0]
    f3 : [4.0 10.0 3.0]
    fill_value : ('N', 1.0000000200408773e+20, 1.0000000200408773e+20, 1.0000000200408773e+20)
              )

=====================

astype works if the target dtype matches. For example if I read the txt with dtype=None, and then use the derived dt, it works:

In [530]: arr = np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None)
In [531]: arr
Out[531]: 
array([(b'a', 1, 2, 4), (b'b', 6, 9, 10), (b'c', 4, 4, 3)], 
      dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [532]: arr.astype(dt)
Out[532]: 
array([('a', 1.0, 2.0, 4.0), ('b', 6.0, 9.0, 10.0), ('c', 4.0, 4.0, 3.0)], 
      dtype=[('f0', '<U1'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<f4')])

Same for arr.astype('U3,int,float,int') which also has 4 compatible fields.