How does one convert an HDF5 file into a Parquet file?

1.8k views Asked by At

I have stored approximately 800 GB of a huge dataframe into HDF5 via pandas with pandas.HDFStore().

import pandas as pd
store = pd.HDFStore('store.h5')
df = pd.Dataframe() # imagine the data being munged into a dataframe
store['df'] = df

I would like to query this with Impala. Is there a straightforward way to parse this data into Parquet? Or does Impala allow you to work with HDF5 directly? Is there another option for data on HDF5?

1

There are 1 answers

1
John Readey On

I haven't tried this myself, but here's a link showing how to convert a HDFStore to Parquet using Spark: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f.