Shuffling multiple HDF5 datasets in-place

2.7k views Asked by At

I have multiple HDF5 datasets saved in the same file, my_file.h5. These datasets have different dimensions, but the same number of observations in the first dimension:

features.shape = (1000000, 24, 7, 1)
labels.shape = (1000000)
info.shape = (1000000, 4)

It is important that the info/label data is correctly connected to each set of features and I therefore want to shuffle these datasets with an identical seed. Furthermore, I would like to shuffle these without ever loading them fully into memory. Is that possible using numpy and h5py?

3

There are 3 answers

0
hpaulj On

Shuffling arrays like this in numpy is straight forward

Create the large suffling index (shuffle np.arange(1000000)) and index the arrays

features = features[I, ...]
labels = labels[I]
info = info[I, :]

This isn't an inplace operation. labels[I] is a copy of labels, not a slice or view.

An alternative

features[I,...] = features

looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the I values are not guaranteed to be unique. In fact there is a special ufunc .at method for unbuffered operations.

But look at what h5py says about this same sort of 'fancy indexing':

http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

labels[I] selection is implemented, but with restrictions.

List selections may not be empty
Selection coordinates must be given in increasing order
Duplicate selections are ignored
Very long lists (> 1000 elements) may produce poor performance

Your shuffled I is, by definition not in increasing order. And it is very large.

Also I don't see anything about using this fancy indexing on the left handside, labels[I] = ....

2
rth On

Shuffling arrays on disk will be time consuming, as it means that you have allocate new arrays in the hdf5 file, then copy all the rows in a different order. You can iterate over rows (or use chunks of rows), if you want to avoid loading all the data at once into memory with PyTables or h5py.

An alternative approach could be to keep your data as it is and simply to map new row numbers to old row numbers in a separate array (that you can keep fully loaded in RAM, since it will be only 4MB with your array sizes). For instance, to shuffle a numpy array x,

x = np.random.rand(5)
idx_map = numpy.arange(x.shape[0])
numpy.random.shuffle(idx_map)

Then you can use advanced numpy indexing to access your shuffled data,

x[idx_map[2]] # equivalent to x_shuffled[2]
x[idx_map]    # equivament to x_shuffled[:], etc.

this will work also with arrays saved to hdf5. Of course there would be some overhead, as compared to writing shuffled arrays on disk, but it could be sufficient depending on your use-case.

0
Andres Romero On
import numpy as np
import h5py

data = h5py.File('original.h5py', 'r')

with h5py.File('output.h5py', 'w') as out:
    indexes = np.arange(data['some_dataset_in_original'].shape[0])
    np.random.shuffle(indexes)
    for key in data.keys():
        print(key)
        feed = np.take(data[key], indexes, axis=0)
        out.create_dataset(key, data=feed)