how can I add metadata to a numpy memmap array?

1k views Asked by At

Is it possible to append a small amount of metadata to numpy memmap files?

That's the entirety of my question. For those interested, the details of my problem are below:

My dataset consists of a bunch of images and their corresponding multi-valued labels, for example:

Images: a 50000 x 96 x 96 x 3 array of uint8s

Labels: a 50000 x 5 array of ints

I'm saving these to a numpy record array of length 50000, and dtype (96, 96, 3) uint8, (5, ) int. This is great because I can save both arrays in a single memmap file using numpy.lib.format.open_memmap().

The one thing missing is the ability to add a small amount of metadata to the file. Specifically, I want to designate the first N entries as the "training set" and the remaining 50000 - N entries as a "testing set". So at minimum, this requires a single int (N) to be added to the file. More generally, I want to allow for an arbitrary number of partitions, and also their partition names. For example, with 3 partitions, this would require saving the following additional data:

partition_names = ["testing set", "validation set", "training set"]
partition_sizes = [30000, 10000, 10000]  # last number redundant

Is there any way to add this metadata to a memmap file, while retaining my ability to memory-map the file using numpy.lib.open_memmap() or something similarly convenient?

PS: I used to use h5py, which is obviously much more amenable to storing such additional data, but its performance when reading out large images turned out to be terrible compared to numpy memmaps.

2

There are 2 answers

1
rth On

As you note in the question HDF5 (or NetCDF) would be a more suitable format for storing complex datasets with multiple arrays, meta-data, etc.

HDF5 was developed and is used on a number of high performance applications. If you are getting much worse results then with a numpy memmap, it probably means that you are not using it efficiently.

Have a look at PyTables with, for instance blosc compression (see for instance this post ). There is a number of things you could fine-tune, if necessary, as explained in the optimization tips (see in particular figure 3).

0
Paul On

It looks like a combination of a nonzero memmap keyword arg offset and some binary file editing will do the trick.