Pandas HDFStore: changing dtype of indexes

1.3k views Asked by At

Does there exist a way of controlling the data type that is used for storing indexes of data frames when using HDFStore.append?

It seems pandas indexes are always stored with 64bits in the hdf5 file. I would like to increase storage efficiency and reduce the size of the index columns.

I have a unique 3 column multi index and saving them as uint64 indexes is an enormous waste of space in my application:

Given the following data frame

In [15]: df.dtypes
Out[15]: indA              int32
indB              int16
indC              int8
data              float32
dtype: object

simply setting df.set_index(['indA', 'indB', 'indC']) before HDFStore.append results in indA, indB and indC being stored as Int64Col in the hdf5 file.

Not setting a pandas index and specifying pytables data columns instead:

store.append('mytable', df, data_columns=['indA', 'indB', 'indB'])

indA, indB and indC are stored with their original dtypes, however an additional Int64Col is stored in the hdf5 file.

This does not really help: by storing in the original dtypes I am saving/conserving 56 bits for the three ind columns. The additional (superfluous) index column however costs 64 bits...

Any ideas?

1

There are 1 answers

1
Jeff On BEST ANSWER

Your approach is the correct one. Data columns provide the searching capability and provide dtype preservation. Index storage is pretty fixed at the moment.

Providing an option to store the index is an open issue, see here. I have done a bit of work on it, but its not high priority ATM. Welcome to have you take a look.

The usual answer to wanting to shrink storage space is to use compression. You seem to be jumping thru a lot of hoops to save a relatively small amount of storage, but that's my2c.