Efficient Data Serialization format for list of arrays in Python

167 views Asked by At

I have a large list of arrays (data type float32, with a few instances of int) in Python that is to be serialized via UTF-8 encoding and saved on a server. However I'm having issues with the size of the saved file exceeding storage limits.

The available serialization formats that the server can handle is: string, bytes, JSON, and XML. Which of these data formats would be best for saving the data structure?

1

There are 1 answers

0
shiv On

I think you should also explore using h5 files. I'd just honestly convert list of lists into a numpy array, and then save the numpy array to an h5 file.

You can save like :->

import numpy as np
import h5py

nparr = np.asanyarray(yourArr)
h5f = h5py.File('listData.h5', 'w')
h5f.create_dataset('dataset_1', data=nparr)
h5f.close()

You can load like:

h5f = h5py.File('listData.h5','r')
loadData = h5f['dataset_1'][:]
h5f.close()

H5 files are optimized for storing matrix-like data.

If you need some helper scripts, try out some I made here(shameless plug): https://github.com/ss4328/h5_manager_scripts

If h5 files are not. your dig, JSON is the best implementation. Tons of free libraries to generate/use the data. Pretty sleek on the web, and not to mention they're human readable too. XML is a bit outdated.