h5py doesn't support NumPy dtype('U') (Unicode) and pandas doesn't support NumPy dtype('O')

457 views Asked by At

I'm trying to create a .h5 file with a dataset that contains the data from a .dat file. First, I approach this using numpy:

import numpy as np
import h5py

filename = 'VAL220408-invparms.dat'
datasetname = 'EM27_104_COCCON_VAL/220408'

dtvec = [float for i in range(149)] #My data file have 149 columns
dtvec[1] = str
dtvec[2] = str #I specify the dtype of the second and third column

dataset = np.genfromtxt(filename,skip_header=0,names=True,dtype=dtvec)

fh5 = h5py.File('my_data.h5', 'w')
fh5.create_dataset(datasetname,data=dataset)
fh5.flush()
fh5.close()

But when running I get the error:

TypeError: No conversion path for dtype: dtype('<U')

If I don't specify the dtype everything is fine, the dataset is in order and the numerical values are correct, just the second and third columns have values of NaN; and I don't want that.

I found that h5py does not support Numpy's encoding for strings, so I supposed that using a dataframe from pandas will work. My code using pandas is like this:

import numpy as np
import pandas as pd

filename = 'VAL220408-invparms.dat'
datasetname = 'EM27_104_COCCON_VAL/220408'

df = pd.read_csv(filename,header=0,sep="\s+")

fh5 = h5py.File('my_data.h5', 'w')
fh5.create_dataset(datasetname,data=df)
fh5.flush()
fh5.close()

But then I get the error:

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Then I found that pandas had a function that transforms a dataframe into a .h5 file, so insted using h5py library I made:

df.to_hdf('my_data.h5','datasetname',format='table',mode='a')

BUT the data is all messed up in many tables inside the .h5 file.

I really would like some help to just get the data of the second and third columns like it really is, a str.

I'm using Python 3.8

Thank you very much for reading.

2

There are 2 answers

0
Ozzluis Hernandez On BEST ANSWER

I just figured it out.

In the h5py docs they say to specify the strings as h5py-strings using:

h5py.string_dtype(encoding='utf-8', length=None)

So in my first piece of code I put:

dtvec[1] = h5py.string_dtype(encoding='utf-8', length=None) 
dtvec[2] = h5py.string_dtype(encoding='utf-8', length=None) 

Hope this is helpful to someone reading this question.

1
kcw78 On

To clarify, this problem is related to handling of NumPy's Unicode string type. HDF5 (and h5py) don't support this type. Details here: h5py: What about NumPy’s U type?

When you define your string fields (columns) as str, you get Unicode values. You can verify with the following:

dtvec = [float for i in range(149)] #My data file have 149 columns
dtvec[1] = str
dtvec[2] = str #I specify the dtype of the second and third column
dataset = np.genfromtxt(filename,names=True,dtype=dtvec)
print(dataset.dtype)

Output will look like this. The <U fields are where you have Unicode values. The Unicode values in fields 'str1' and 'str2' caused your original error.

[('float1', '<f8'), ('str1', '<U'), ('str2', '<U'), ('float2', '<f8').....]

When you modify to use h5py.string_dtype(), h5py knows how to convert the Unicode values to byte strings (which are supported by HDF5 and h5py). Setting length=None allows for variable length strings which are mapped to NumPy objects (arrays of byte strings). Details here: h5py: Variable-length strings

dtvec[1] = h5py.string_dtype(encoding='utf-8', length=None) 
dtvec[2] = h5py.string_dtype(encoding='utf-8', length=None) 
dataset = np.genfromtxt(filename,names=True,dtype=dtvec)
print(dataset.dtype)

Output will look like this. The O fields are where you have strings (as arrays of byte strings):

[('float1', '<f8'), ('str1', 'O'), ('str2', 'O'), ('float2', '<f8').....]

You can also define fixed length byte strings. (I used 5 because that's the size of my test data.)

dtvec[1] = h5py.string_dtype(encoding='utf-8', length=5) 
dtvec[2] = h5py.string_dtype(encoding='utf-8', length=5) 
# alternate definition, same result
# dtvec[1] = 'S5'
# dtvec[2] = 'S5'

dataset = np.genfromtxt(filename,names=True,dtype=dtvec)
print(dataset.dtype)

Output will look like this. The S5 fields are where you have byte strings:

[('float1', '<f8'), ('str1', 'S5'), ('str2', 'S5'), ('float2', '<f8').....]

As an aside on np.genfromtxt(), you don't have to define the dtype. If you set dtype=None the dtype for each column will be determined by their contents (individually). This is handy when you don't know the data types in advance. Here is an example for your data:

dataset = np.genfromtxt(filename,names=True,dtype=None)
print(dataset.dtype)

Output will look like this. I did not set the encoding= parameter above, so get string byte values. np.genfromtxt() will issue a VisibleDeprecationWarning when you do. However, you can write this data to HDF5.

[('float1', '<f8'), ('str1', 'S5'), ('str2', 'S5'), ('float2', '<f8').....]