I have stored approximately 800 GB of a huge dataframe into HDF5 via pandas with pandas.HDFStore()
.
import pandas as pd
store = pd.HDFStore('store.h5')
df = pd.Dataframe() # imagine the data being munged into a dataframe
store['df'] = df
I would like to query this with Impala. Is there a straightforward way to parse this data into Parquet? Or does Impala allow you to work with HDF5 directly? Is there another option for data on HDF5?
I haven't tried this myself, but here's a link showing how to convert a HDFStore to Parquet using Spark: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f.