Say my data looks like this
thisList = [
[[13, 43, 21, 4], [33, 2, 111, 33332, 23, 43, 2, 2], [232, 2], [23, 11]] ,
[[21, 2233, 2], [2, 3, 2,1, 32, 22], [3]],
[[3]],
[[23, 12], [55, 3]],
....
]
What is the most space-efficient way to store this time of data?
I looked at Numpy files, but numpy only supports uniform length data
I looked at Hdf5, which has support for 1d ragged tensors, but not 2d
https://stackoverflow.com/a/42659049/3259896
So there's an option of creating a separate hdf5 file for every list in thisList, but I would have potentially 10-20 million those lists.
I ran benchmarks saving a ragged nested list with JSON, BSON, Numpy, and HDF5.
TLDR: use compressed JSON, because it is the most space efficient and easiest to encode/decode.
On the synthetic data, here are the results (with
du -sh test*):Compressed JSON is the most efficient in terms of storage, and it is also the easiest to encode and decode because the ragged list does not have to be converted to a mapping. BSON comes in second, but it has to be converted to a mapping, which complicates encoding and decoding (and negating the encoding/decoding speed benefits of BSON over JSON). Numpy's compressed NPZ format is third best, but like BSON, the ragged list must be made into a dictionary before saving. HDF5 is surprisingly large, especially compressed. This is probably because there are many different datasets, and compression adds overhead to each dataset.
Benchmarks
Here is the relevant code for the benchmarking. The
bsonpackage is part ofpymongo. I ran these benchmarks on a Debian Buster machine with anext4filesystem.Versions of relevant packages: