I have multiple HDF5 datasets saved in the same file, my_file.h5. These datasets have different dimensions, but the same number of observations in the first dimension:
features.shape = (1000000, 24, 7, 1)
labels.shape = (1000000)
info.shape = (1000000, 4)
It is important that the info/label data is correctly connected to each set of features and I therefore want to shuffle these datasets with an identical seed. Furthermore, I would like to shuffle these without ever loading them fully into memory. Is that possible using numpy and h5py?
Shuffling arrays like this in
numpyis straight forwardCreate the large suffling index (shuffle
np.arange(1000000)) and index the arraysThis isn't an inplace operation.
labels[I]is a copy oflabels, not a slice or view.An alternative
looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the
Ivalues are not guaranteed to be unique. In fact there is a specialufunc.atmethod for unbuffered operations.But look at what
h5pysays about this same sort of 'fancy indexing':http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
labels[I]selection is implemented, but with restrictions.Your shuffled
Iis, by definition not in increasing order. And it is very large.Also I don't see anything about using this fancy indexing on the left handside,
labels[I] = ....