I have multiple HDF5 datasets saved in the same file, my_file.h5
. These datasets have different dimensions, but the same number of observations in the first dimension:
features.shape = (1000000, 24, 7, 1)
labels.shape = (1000000)
info.shape = (1000000, 4)
It is important that the info/label data is correctly connected to each set of features and I therefore want to shuffle these datasets with an identical seed. Furthermore, I would like to shuffle these without ever loading them fully into memory. Is that possible using numpy and h5py?
Shuffling arrays like this in
numpy
is straight forwardCreate the large suffling index (shuffle
np.arange(1000000)
) and index the arraysThis isn't an inplace operation.
labels[I]
is a copy oflabels
, not a slice or view.An alternative
looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the
I
values are not guaranteed to be unique. In fact there is a specialufunc
.at
method for unbuffered operations.But look at what
h5py
says about this same sort of 'fancy indexing':http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
labels[I]
selection is implemented, but with restrictions.Your shuffled
I
is, by definition not in increasing order. And it is very large.Also I don't see anything about using this fancy indexing on the left handside,
labels[I] = ...
.