Does there exist a way of controlling the data type that is used for storing indexes of data frames when using HDFStore.append?
It seems pandas indexes are always stored with 64bits in the hdf5 file. I would like to increase storage efficiency and reduce the size of the index columns.
I have a unique 3 column multi index and saving them as uint64 indexes is an enormous waste of space in my application:
Given the following data frame
In [15]: df.dtypes
Out[15]: indA int32
indB int16
indC int8
data float32
dtype: object
simply setting df.set_index(['indA', 'indB', 'indC'])
before HDFStore.append
results in indA
, indB
and indC
being stored as Int64Col
in the hdf5 file.
Not setting a pandas index and specifying pytables data columns instead:
store.append('mytable', df, data_columns=['indA', 'indB', 'indB'])
indA
, indB
and indC
are stored with their original dtypes, however an additional Int64Col
is stored in the hdf5 file.
This does not really help: by storing in the original dtypes I am saving/conserving 56 bits for the three ind
columns. The additional (superfluous) index column however costs 64 bits...
Any ideas?
Your approach is the correct one. Data columns provide the searching capability and provide dtype preservation. Index storage is pretty fixed at the moment.
Providing an option to store the index is an open issue, see here. I have done a bit of work on it, but its not high priority ATM. Welcome to have you take a look.
The usual answer to wanting to shrink storage space is to use compression. You seem to be jumping thru a lot of hoops to save a relatively small amount of storage, but that's my2c.