Efficiently remove every 4th byte from a numpy.int32 array's data bytes

527 views Asked by At

I have a big numpy.int32 array, that can take 4GB or more. It's in fact a 24-bit integer array (quite common in audio applications), but as numpy.int24 doesn't exist, I used an int32.

I want to output this array's data as 24-bit, i.e. 3 bytes per number, to a file.

  • This works (I found this "recipe" somewhere a while ago, but I can't find where anymore):

      import numpy as np
      x = np.array([[-33772,-2193],[13313,-1314],[20965,-1540],[10706,-5995],[-37719,-5871]], dtype=np.int32)
      data = ((x.reshape(x.shape + (1,)) >> np.array([0, 8, 16])) & 255).astype(np.uint8)
      print(data.tostring())
    
      # b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95\xe8\xff\xa9l\xff\x11\xe9\xff'
    

    but the many reshape makes it inefficient when x is a few GB large: it takes a huge amount of unneeded RAM.

  • Another solution would be to remove every 4th byte:

    s = bytes([c for i, c in enumerate(x.tostring()) if i % 4 != 3])
    
    # b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95\xe8\xff\xa9l\xff\x11\xe9\xff'
    

    It works but I suspect that if x takes, say, 4 GB of RAM, this line would eat at least 8 GB of RAM, for both s and x (and maybe also x.tostring()?)

TL;DR: How to efficiently (without using in RAM twice the size of the actual data) write to disk an int32 array, as a 24 bit array, by removing every 4th byte?

Note: it's possible since the integers are in fact 24-bit, i.e. each value is < 2^23-1 in absolute value

3

There are 3 answers

1
Basj On

After some more fiddling, I found this to be working:

import numpy as np
x = np.array([[-33772,-2193],[13313,-1314],[20965,-1540],[10706,-5995],[-37719,-5871]], dtype=np.int32)
x2 = x.view(np.uint8).reshape(-1,4)[:,:3]
print(x2.tostring())
# b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95\xe8\xff\xa9l\xff\x11\xe9\xff'

Here is a time+memory benchmark:

import numpy as np, time
t0 = time.time()
x = np.random.randint(10000, size=(125_000_000, 2), dtype=np.int32)  # 125M * 2 * 4 bytes ~ 1GB of RAM
print('Random array generated in %.1f sec.' % (time.time() - t0))
time.sleep(5)  
# you can check the RAM usage in the task manager in the meantime...
t0 = time.time()
x2 = x.view(np.uint8).reshape(-1,4)[:,:3]
x2.tofile('test')
print('24-bit output file written in %.1f sec.' % (time.time() - t0))

Result:

Random array generated in 4.6 sec.
24-bit output file written in 35.9 sec.

Also, only ~1GB has been used during the whole processing (monitored with Windows Task Manager)


@jdehesa's method gave similar results, i.e. if we use this line instead:

x2 = np.ndarray(shape=x.shape + (3,), dtype=np.uint8, buffer=x, offset=0, strides=x.strides + (1,))

the RAM usage of the process also peaked at 1GB, and the time spent on x2.tofile(...) is ~37 sec.

4
javidcf On

Assuming x is C-contiguous and that your platform is little-endian (would need little adjustments otherwise), you can do that like this:

import numpy as np

# Input data
x = np.array([[-33772, -2193], [13313, -1314], [20965, -1540],
              [10706, -5995], [-37719, -5871]], dtype=np.int32)
# Make 24-bit uint8 view
x2 = np.ndarray(shape=x.shape + (3,), dtype=np.uint8, buffer=x, offset=0, 
                strides=x.strides + (1,))  
print(x2.tostring())
# b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95...
np.save('data.npy', x2)  # Save to disk

In this example, note that:

  • we added a dimension: x.shape + (3,) is (5, 2, 3).
  • x2 essentially is a view of x, that is, it uses the same data.
  • The trick is in the strides. x.strides + (1,) is here (8, 4, 1). Every new row of x advances 8 bytes with respect to its previous row, and every new column advances 4 bytes. In x2, I add a 1 to the strides so every item in the new innermost dimension advances 1 byte with respect to the previous one. If the shape of x2 was (5, 2, 4) (that is, using + (4,) instead of + (3,)), it would be the same as x, but since it is (5, 2, 3), the last byte is just "skipped over".

You can restore it with:


x2 = np.load('data.npy', mmap_mode='r')  # Use mmap to avoid using extra memory
x3 = np.zeros(x2.shape[:-1] + (4,), np.uint8)
x3[..., :3] = x2
del x2  # Release mmap
# Fix negative sign in last byte (could do this in a loop
# or in "batches" if you want to avoid the intermediate
# array from the "&" operation, or with Numba)
x3[..., 3] = np.where(x3[..., 2] & 128, 255, 0)
# Make int32 view
x4 = np.ndarray(x3.shape[:-1], np.int32, buffer=x3, offset=0, strides=x3.strides[:-1])
print(x4)
# [[-33772  -2193]
#  [ 13313  -1314]
#  [ 20965  -1540]
#  [ 10706  -5995]
#  [-37719  -5871]]
0
Mark Setchell On

I ran your code and got similar times to your 35 sec, but that seems way too slow for 750MB when my SSD can do 2GB/s. I can't imagine why it is so slow. So I decided to use OpenCV's highly optimised SIMD code which reduces an RGBA8888 image to RGB888 by stripping every 4th byte of Alpha/transparency information - which is equivalent to converting 32-bit to 24-bit.

In order not to use too much extra memory, I did it in chunks of 1,000,000 stereo samples (6MB) at a time, and appended to the output file. It runs in under 1 sec and the files compare identical to that created by your code.

#!/usr/bin/env python3

import numpy as np
import cv2

def orig(x):
    x2 = x.view(np.uint8).reshape(-1,4)[:,:3]
    x2.tofile('orig.dat')

def chunked(x):
    BATCHSIZE = 1_000_000
    l = len(x)
    with open('test.dat', 'w') as file:
        for b in range(0,l,BATCHSIZE):
            s = min(BATCHSIZE,l-b)
            y = x[b:b+s,:].view(np.uint8).reshape(s*2,1,4) 
            z = cv2.cvtColor(y,cv2.COLOR_BGRA2BGR)
            # Append to file
            z.tofile(file)
            if b+s == l:
                break


# Repeatable randomness
np.random.seed(42)                                                                                         
# Create array of stereo samples
NSAMPLES = 125_000_000
x = np.random.randint(10000, size=(NSAMPLES, 2), dtype=np.int32)

# orig(x)
chunked(x)