I have a big numpy.int32
array, that can take 4GB or more. It's in fact a 24-bit integer array (quite common in audio applications), but as numpy.int24
doesn't exist, I used an int32
.
I want to output this array's data as 24-bit, i.e. 3 bytes per number, to a file.
This works (I found this "recipe" somewhere a while ago, but I can't find where anymore):
import numpy as np x = np.array([[-33772,-2193],[13313,-1314],[20965,-1540],[10706,-5995],[-37719,-5871]], dtype=np.int32) data = ((x.reshape(x.shape + (1,)) >> np.array([0, 8, 16])) & 255).astype(np.uint8) print(data.tostring()) # b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95\xe8\xff\xa9l\xff\x11\xe9\xff'
but the many
reshape
makes it inefficient whenx
is a few GB large: it takes a huge amount of unneeded RAM.Another solution would be to remove every 4th byte:
s = bytes([c for i, c in enumerate(x.tostring()) if i % 4 != 3]) # b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95\xe8\xff\xa9l\xff\x11\xe9\xff'
It works but I suspect that if
x
takes, say, 4 GB of RAM, this line would eat at least 8 GB of RAM, for boths
andx
(and maybe alsox.tostring()
?)
TL;DR: How to efficiently (without using in RAM twice the size of the actual data) write to disk an int32 array, as a 24 bit array, by removing every 4th byte?
Note: it's possible since the integers are in fact 24-bit, i.e. each value is < 2^23-1 in absolute value
After some more fiddling, I found this to be working:
Here is a time+memory benchmark:
Result:
Also, only ~1GB has been used during the whole processing (monitored with Windows Task Manager)
@jdehesa's method gave similar results, i.e. if we use this line instead:
the RAM usage of the process also peaked at 1GB, and the time spent on
x2.tofile(...)
is ~37 sec.