Pytables - simple h5 data

262 views Asked by At

I'm finding it a lot harder to read in h5 data with pytables than I thought I would.

I can use the software hdfview to see indeed my h5 file is essentially a few 2d tables. This isn't useful to the extent I'd like to use them in python/read them in for processing there as part of physical grids.

Documentation I've looked at seems to be much more complicated than what I need. Is there a simple example to read in a file with four entries ("Latitude, Longitude, fakeDim0, fakeDim1")?

I would think that it would be similar to e.g. pandas.read_csv or equivalents, where we just 'read it in and get a table'.

Am I missing something simple?

2

There are 2 answers

0
Dmitrii Malygin On

It's a little bit harder than pandas, but maybe get used to it. There is some example of using PyTables to get it easier:

import tables as tb

with tb.open_file('your_file.h5', mode='r') as h5file:
    table = h5file.root.entry_name
    # 'entry_name' is name of your HDF5 file
    
    # read into a NumPy array
    data = table.read()
    
    # access the data columns by name
    latitude = data['Latitude']
    longitude = data['Longitude']
    dim0 = data['dim0']
    dim1 = data['dim1']
0
kcw78 On

HDF5 is a container, with a user defined data schema. So, accessing the data depends on the schema. Some background is required. HDF5, PyTables, and NumPy use slightly different terminology for their data objects.

  • HDF5 stores data in "datasets". They can store either homogeneous data or heterogeneous (sometimes called compound) data. Heterogeneous data objects are limited to 2-d shape.
  • PyTables has 2 types of storage classes (object types): "Arrays" are used for homogeneous data (there are actually 4 types of arrays). "Tables" are used for structured data (heterogeneous or compound data).
  • NumPy arrays can also store homogeneous or heterogeneous data. The data type is defined by the dype attribute.
  • The table below maps the objects across the packages:
HDF5 PyTables NumPy
heterogeneous Table shape=(Nrows,), dtype defines field/column names
homogeneous Array shape=any, dytpe defines all data

That's why more information is required to write code specific to your file. Different PyTables functions are used to access homogeneous and heterogeneous datasets, AND the NumPy objects that are returned are slightly different.

The code below creates and reads data from 2 very simple files. The 1st code segment creates 2 HDF5 files: the 1st has 1 heterogeneous/compound dataset (_1ds.h5), and the 2nd has 4 homogeneous datasets (_4ds.h5). You don't need to know how to create the files. Just run it and view the files in HDFView to see the data structure used in "read" code below. Pick the one that matches your file.

Code to create the files below:

# Create some data, saved in a recarray with field names from post
names = ["Latitude", "Longitude", "fakeDim0", "fakeDim1"]
arr_dt = np.dtype( {'names':names, 'formats':[float for _ in range(len(names))]} )
recarr = np.empty(shape=(10,), dtype=arr_dt)
recarr["Latitude"] = [10.*x for x in range(10)]
recarr["Longitude"] = [-10.*x for x in range(10)]
recarr["fakeDim0"] = [0.1*x for x in range(10)]
recarr["fakeDim1"] = [0.2*x for x in range(10)]
 
with tb.File('SO_75898559_1ds.h5','w') as h5f:  
    h5f.create_table('/','Example_Table',obj=recarr)
    
with tb.File('SO_75898559_4ds.h5','w') as h5f:
     for name in names:
        #extract columns from recarray and save eaqch as seperate dataset
        data = recarr[name]
        h5f.create_array('/',name,obj=data)

Code to read the files below:

with tb.File('SO_75898559_1ds.h5','r') as h5f: 
    # use natural naming to define path to table 
    ex_table = h5f.root.Example_Table
    # OR use get_node()
    ex_table = h5f.get_node('/Example_Table')
    recarr = ex_table.read()
    print(f'For Table {ex_table._v_name}; np.array type = {type(recarr)}')
    print(f'\tTable shape = {recarr.shape}')
    print(f'\tTable dtype = {recarr.dtype}')  

    # to get an array of data from a field/column of recarr:
    arr_lat = recarr["Latitude"]   
    # OR read from Table
    arr_lat = ex_table.read(field="Latitude")    
    print("\nLatitude data:\n",arr_lat)

    # to read data row-by-row from the table:
    print("\nrow data:")
    for row in ex_table:
        print([row[fname] for fname in ex_table.colnames])

print()    
with tb.File('SO_75898559_4ds.h5','r') as h5f:
    for dset in h5f.iter_nodes('/',classname='Array'):
        arr = dset.read()
        print(f'For Array {dset._v_name}; np.array type = {type(arr)}')
        print(f'\tArray shape = {arr.shape}')
        print(f'\tArray dtype = {arr.dtype}')