Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers

Question

Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers

20.7k views Asked by Phil At 11 June 2015 at 06:19

Here is a sample data frame:

import pandas as pd

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

I know that pandas has the pytables based HDFStore, which is an easy way to efficiently serialize/deserialize a data frame. But those datasets are not very easy to load directly using a reader h5py or matlab. How can I save a data frame using h5py, so that I can easily load it back using another hdf5 reader?

Original Q&A

There are 3 answers

Phil On 11 June 2015 at 06:22

Here is my approach to solving this problem. I am hoping either someone else has a better solution or my approach is helpful to others.

First, define function to make a numpy structure array (not a record array) from a pandas DataFrame.

import numpy as np
def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns
    types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Save that structured array to an hdf5 file.

import h5py
with h5py.File('mydata.h5', 'w') as hf:
            hf['df'] = sa

Load the h5 dataset

with h5py.File('mydata.h5') as hf:
            sa2 = hf['df'][:]

Extract the ID column and delete it from sa2

import numpy.lib.recfunctions as nprec
ID = sa2['ID']
sa2 = nprec.drop_fields(sa2, 'ID')

Make data frame with index ID using sa2

df2 = pd.DataFrame(sa2, index=ID)
df2.index.name = 'ID'

print(df2)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

iipr On 03 April 2017 at 17:29

In case it is helpful for anyone, I took this post from Guillaume and Phil, and changed it a bit for my needs with the help of ankostis. We read the pandas DataFrame from a CSV file.

Mainly I adapted it for Strings, because you cannot store a object in a HDF5 file (I believe). Firstly check which columns types are numpy objects. Then check which is the longest length of that column, and fix that column to be a String of that length. The rest is quite similar to the other post.

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    Also, for every column of a str type, convert it into 
    a 'bytes' str literal of length = max(len(col)).

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    def make_col_type(col_type, col):
        try:
            if 'numpy.object_' in str(col_type.type):
                maxlens = col.dropna().str.len()
                if maxlens.any():
                    maxlen = maxlens.max().astype(int) 
                    col_type = ('S%s' % maxlen, 1)
                else:
                    col_type = 'f2'
            return col.name, col_type
        except:
            print(col.name, col_type, col_type.type, type(col))
            raise

    v = df.values            
    types = df.dtypes
    numpy_struct_types = [make_col_type(types[col], df.loc[:, col]) for col in df.columns]
    dtype = np.dtype(numpy_struct_types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        # This is in case you have problems with the encoding, remove the if branch if not
        try:
            if dtype[i].str.startswith('|S'):
                z[k] = df[k].str.encode('latin').astype('S')
            else:
                z[k] = v[:, i]
        except:
            print(k, v[:, i])
            raise

    return z, dtype

So the workflow would be:

import h5py
import pandas as pd

# Read a CSV file
# Here we assume col_dtypes is a dictionary that contains the dtypes of the columns
df = pd.read_table('./data.csv', sep='\t', dtype=col_dtypes)
# Transform the DataFrame into a structured numpy array and get the dtype
sa, saType = df_to_sarray(df)

# Open/create the HDF5 file
f = h5py.File('test.hdf5', 'a')
# Save the structured array
f.create_dataset('someData', data=sa, dtype=saType)
# Retrieve it and check it is ok when you transform it into a pandas DataFrame
sa2 = f['someData'][:]
df2 = pd.DataFrame(sa2)
print(df2.head())
f.close()

Also, in this way you are able to see it from HDFView even when using gzip compression for instance.

**Jeff** · Accepted Answer · 2015-06-11T17:08:09+00:00

The pandas HDFStore format is standard HDF5, with just a convention for how to interpret the meta-data. Docs are here

In [54]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)

In [55]: h = h5py.File('test.h5')

In [56]: h['df']['table']
Out[56]: <HDF5 dataset "table": shape (7,), type "|V32">

In [64]: h['df']['table'][:]
Out[64]: 
array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])


In [57]: h['df']['table'].attrs.items()
Out[57]: 
[(u'CLASS', 'TABLE'),
 (u'VERSION', '2.7'),
 (u'TITLE', ''),
 (u'FIELD_0_NAME', 'index'),
 (u'FIELD_1_NAME', 'A'),
 (u'FIELD_2_NAME', 'B'),
 (u'FIELD_3_NAME', 'C'),
 (u'FIELD_0_FILL', 0),
 (u'FIELD_1_FILL', 0.0),
 (u'FIELD_2_FILL', 0.0),
 (u'FIELD_3_FILL', 0.0),
 (u'index_kind', 'integer'),
 (u'A_kind', "(lp1\nS'A'\na."),
 (u'A_meta', 'N.'),
 (u'A_dtype', 'float64'),
 (u'B_kind', "(lp1\nS'B'\na."),
 (u'B_meta', 'N.'),
 (u'B_dtype', 'float64'),
 (u'C_kind', "(lp1\nS'C'\na."),
 (u'C_meta', 'N.'),
 (u'C_dtype', 'float64'),
 (u'NROWS', 7)]

In [58]: h.close()

The data will be completely readable in any HDF5 reader. Some of the meta-data is pickled, so care must be taken.

TechQA.

Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers

There are 3 answers

Related Questions in MATLAB

Related Questions in NUMPY

Related Questions in PANDAS

Related Questions in H5PY

Popular Questions

Popular Tags

Trending Questions