How can I serialize a numpy array while preserving matrix dimensions?

67.2k views Asked by At

numpy.array.tostring doesn't seem to preserve information about matrix dimensions (see this question), requiring the user to issue a call to numpy.array.reshape.

Is there a way to serialize a numpy array to JSON format while preserving this information?

Note: The arrays may contain ints, floats or bools. It's reasonable to expect a transposed array.

Note 2: this is being done with the intent of passing the numpy array through a Storm topology using streamparse, in case such information ends up being relevant.

9

There are 9 answers

3
user2357112 On BEST ANSWER

pickle.dumps or numpy.save encode all the information needed to reconstruct an arbitrary NumPy array, even in the presence of endianness issues, non-contiguous arrays, or weird structured dtypes. Endianness issues are probably the most important; you don't want array([1]) to suddenly become array([16777216]) because you loaded your array on a big-endian machine. pickle is probably the more convenient option, though save has its own benefits, given in the npy format rationale.

I'm giving options for serializing to JSON or a bytestring, because the original questioner needed JSON-serializable output, but most people coming here probably don't.

The pickle way:

import pickle
a = # some NumPy array

# Bytestring option
serialized = pickle.dumps(a)
deserialized_a = pickle.loads(serialized)

# JSON option
# latin-1 maps byte n to unicode code point n
serialized_as_json = json.dumps(pickle.dumps(a).decode('latin-1'))
deserialized_from_json = pickle.loads(json.loads(serialized_as_json).encode('latin-1'))

numpy.save uses a binary format, and it needs to write to a file, but you can get around that with io.BytesIO:

a = # any NumPy array
memfile = io.BytesIO()
numpy.save(memfile, a)

serialized = memfile.getvalue()
serialized_as_json = json.dumps(serialized.decode('latin-1'))
# latin-1 maps byte n to unicode code point n

And to deserialize:

memfile = io.BytesIO()

# If you're deserializing from a bytestring:
memfile.write(serialized)
# Or if you're deserializing from JSON:
# memfile.write(json.loads(serialized_as_json).encode('latin-1'))
memfile.seek(0)
a = numpy.load(memfile)
4
daniel451 On

EDIT: As one can read in the comments of the question this solution deals with "normal" numpy arrays (floats, ints, bools ...) and not with multi-type structured arrays.

Solution for serializing a numpy array of any dimensions and data types

As far as I know you can not simply serialize a numpy array with any data type and any dimension...but you can store its data type, dimension and information in a list representation and then serialize it using JSON.

Imports needed:

import json
import base64

For encoding you could use (nparray is some numpy array of any data type and any dimensionality):

json.dumps([str(nparray.dtype), base64.b64encode(nparray), nparray.shape])

After this you get a JSON dump (string) of your data, containing a list representation of its data type and shape as well as the arrays data/contents base64-encoded.

And for decoding this does the work (encStr is the encoded JSON string, loaded from somewhere):

# get the encoded json dump
enc = json.loads(encStr)

# build the numpy data type
dataType = numpy.dtype(enc[0])

# decode the base64 encoded numpy array data and create a new numpy array with this data & type
dataArray = numpy.frombuffer(base64.decodestring(enc[1]), dataType)

# if the array had more than one data set it has to be reshaped
if len(enc) > 2:
     dataArray.reshape(enc[2])   # return the reshaped numpy array containing several data sets

JSON dumps are efficient and cross-compatible for many reasons but just taking JSON leads to unexpected results if you want to store and load numpy arrays of any type and any dimension.

This solution stores and loads numpy arrays regardless of the type or dimension and also restores it correctly (data type, dimension, ...)

I tried several solutions myself months ago and this was the only efficient, versatile solution I came across.

2
Ken On

Try using numpy.array_repr or numpy.array_str.

0
VoteCoffee On

This wraps the pickle-based answer by @user2357112 for easier JSON integration

The code below will encode it as base64. It will handle numpy arrays of any type/size without needing to remember what it was. It will also handle other arbitrary objects that can be pickled.

import numpy as np
import json
import pickle
import codecs

class PythonObjectEncoder(json.JSONEncoder):
    def default(self, obj):
        return {
            '_type': str(type(obj)),
            'value': codecs.encode(pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL), "base64").decode('latin1')
            }

class PythonObjectDecoder(json.JSONDecoder):
    def __init__(self, *args, **kwargs):
        json.JSONDecoder.__init__(self, object_hook=self.object_hook, *args, **kwargs)

    def object_hook(self, obj):
        if '_type' in obj:
            try:
                return pickle.loads(codecs.decode(obj['value'].encode('latin1'), "base64"))
            except KeyError:
                return obj
        return obj


# Create arbitrary array
originalNumpyArray = np.random.normal(size=(3, 3))
print(originalNumpyArray)

# Serialization
numpyData = {
   "array": originalNumpyArray
   }
encodedNumpyData = json.dumps(numpyData, cls=PythonObjectEncoder)
print(encodedNumpyData)

# Deserialization
decodedArrays = json.loads(encodedNumpyData, cls=PythonObjectDecoder)
finalNumpyArray = decodedArrays["array"]

# Verify
print(finalNumpyArray)
print(np.allclose(originalNumpyArray, finalNumpyArray))
print((originalNumpyArray==finalNumpyArray).all())
0
SemanticBeeng On

Try traitschema https://traitschema.readthedocs.io/en/latest/

"Create serializable, type-checked schema using traits and Numpy. A typical use case involves saving several Numpy arrays of varying shape and type."

1
throws_exceptions_at_you On

Try numpy-serializer:

Download

pip install numpy-serializer

Usage

import numpy_serializer as ns
import numpy as np

a = np.random.normal(size=(50,120,150))
b = ns.to_bytes(a)
c = ns.from_bytes(b)
assert np.array_equal(a,c)
0
Chris.Wilson On

If it needs to be human readable and you know that this is a numpy array:

import numpy as np; 
import json;

a = np.random.normal(size=(50,120,150))
a_reconstructed = np.asarray(json.loads(json.dumps(a.tolist())))
print np.allclose(a,a_reconstructed)
print (a==a_reconstructed).all()

Maybe not the most efficient as the array sizes grow larger, but works for smaller arrays.

0
Rebs On

I found the code in Msgpack-numpy helpful. https://github.com/lebedov/msgpack-numpy/blob/master/msgpack_numpy.py

I modified the serialised dict slightly and added base64 encoding to reduce the serialised size.

By using the same interface as json (providing load(s),dump(s)), you can provide a drop-in replacement for json serialisation.

This same logic can be extended to add any automatic non-trivial serialisation, such as datetime objects.


EDIT I've written a generic, modular, parser that does this and more. https://github.com/someones/jaweson


My code is as follows:

np_json.py

from json import *
import json
import numpy as np
import base64

def to_json(obj):
    if isinstance(obj, (np.ndarray, np.generic)):
        if isinstance(obj, np.ndarray):
            return {
                '__ndarray__': base64.b64encode(obj.tostring()),
                'dtype': obj.dtype.str,
                'shape': obj.shape,
            }
        elif isinstance(obj, (np.bool_, np.number)):
            return {
                '__npgeneric__': base64.b64encode(obj.tostring()),
                'dtype': obj.dtype.str,
            }
    if isinstance(obj, set):
        return {'__set__': list(obj)}
    if isinstance(obj, tuple):
        return {'__tuple__': list(obj)}
    if isinstance(obj, complex):
        return {'__complex__': obj.__repr__()}

    # Let the base class default method raise the TypeError
    raise TypeError('Unable to serialise object of type {}'.format(type(obj)))


def from_json(obj):
    # check for numpy
    if isinstance(obj, dict):
        if '__ndarray__' in obj:
            return np.fromstring(
                base64.b64decode(obj['__ndarray__']),
                dtype=np.dtype(obj['dtype'])
            ).reshape(obj['shape'])
        if '__npgeneric__' in obj:
            return np.fromstring(
                base64.b64decode(obj['__npgeneric__']),
                dtype=np.dtype(obj['dtype'])
            )[0]
        if '__set__' in obj:
            return set(obj['__set__'])
        if '__tuple__' in obj:
            return tuple(obj['__tuple__'])
        if '__complex__' in obj:
            return complex(obj['__complex__'])

    return obj

# over-write the load(s)/dump(s) functions
def load(*args, **kwargs):
    kwargs['object_hook'] = from_json
    return json.load(*args, **kwargs)


def loads(*args, **kwargs):
    kwargs['object_hook'] = from_json
    return json.loads(*args, **kwargs)


def dump(*args, **kwargs):
    kwargs['default'] = to_json
    return json.dump(*args, **kwargs)


def dumps(*args, **kwargs):
    kwargs['default'] = to_json
    return json.dumps(*args, **kwargs)

You should be able to then do the following:

import numpy as np
import np_json as json
np_data = np.zeros((10,10), dtype=np.float32)
new_data = json.loads(json.dumps(np_data))
assert (np_data == new_data).all()
0
thayne On

Msgpack has the best serialization performance: http://www.benfrederickson.com/dont-pickle-your-data/

Use msgpack-numpy. See https://github.com/lebedov/msgpack-numpy

Install it:

pip install msgpack-numpy

Then:

import msgpack
import msgpack_numpy as m
import numpy as np

x = np.random.rand(5)
x_enc = msgpack.packb(x, default=m.encode)
x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)