Speeding up unpacking of encrypted image file in python

60 views Asked by At

I am currently unpacking an encrypted file from the software I use to get an image (2048x2048) from a file (along with other information). I'm currently able to do this but it takes about 1.7 seconds to load. Normally this would be fine but I'm loading 40ish images at each iteration and my next step in this simulation is to add more iterations. I've been trying to use JIT interpreters like pypy and numba. The code below is just one function in a larger object but it's where I'm having the most timelag.

Pypy works but when I call my numpy functions, it takes twice as long. So I tried using numba, but it doesn't seem to like unpack. I tried using numba within pypy, but that also seems to not work. My code goes a bit like this

from struct import unpack
import numpy as np

def read_file(filename: str, nx: int, ny: int) -> tuple:
    f = open(filename, "rb")
    raw = [unpack('d', f.read(8))[0] for _ in range(2*nx*ny)] #Creates 1D list

    real_image = np.asarray(raw[0::2]).reshape(nx,ny) #Every other point is the real part of the image
    imaginary_image = np.asarray(raw[1::2]).reshape(nx,ny) #Every other point +1 is imaginary part of image

    return real_image, imaginary_image

In my normal python interpreter, the raw line takes about 1.7 seconds and the rest are <0.5 seconds. If I comment out the numpy lines and just unpack in pypy, the raw operation takes about 0.3 seconds. However, if I perform the the reshaping operations, it takes a lot longer (I know it has to do with fact that numpy is optimized in C and will take longer to convert).

So I just discovered numba and thought I'd give it a try by going back to my normal python interpreter (CPython?). If I add the @njit or @vectorize decorators to the function I get the following error message

File c:\Users\MyName\Anaconda3\envs\myenv\Lib\site-packages\numba\core\dispatcher.py:468, in _DispatcherBase._compile_for_args(self, *args, **kws)
    464         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    465                f"by the following argument(s):\n{args_str}\n")
    466         e.patch_message(msg)
--> 468     error_rewrite(e, 'typing')
    469 except errors.UnsupportedError as e:
    470     # Something unsupported is present in the user code, add help info
    471     error_rewrite(e, 'unsupported_error')

File c:\Users\MyName\Anaconda3\envs\myenv\Lib\site-packages\numba\core\dispatcher.py:409, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    407     raise e
    408 else:
--> 409     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'unpack': Cannot determine Numba type of <class 'builtin_function_or_method'>

I may be reading this error message wrong but it seems that Numba does not like built in functions? I haven't looked into any of the other options like Cython. Is there some way to make Numba or pypy work? I'm mostly interested in speeding this operation up so I'd be very interested to know what people think is the best option. I'd be willing to explore optmizing in C++ but I'm not aware of how to link the two

2

There are 2 answers

2
ShadowRanger On BEST ANSWER

Issuing tons of .read(8) calls and many small unpackings is dramatically increasing your overhead, with limited benefit. If you weren't using numpy already, I'd point you to preconstructing an instance of struct.Struct and/or using .iter_unpack to dramatically reduce the costs of looking up Structs to use for unpacking, and replacing a bunch of tiny read calls with a bulk read (you need all the data in memory anyway), but since you're using numpy, you can have it do all the work for you much more easily:

import numpy as np

def read_file(filename: str, nx: int, ny: int) -> tuple:
    data_needed = 2*8*nx*ny
    with open(filename, "rb") as f:  # Use with statements so you don't risk leaking file handles
        raw = f.read(data_needed)  # Perform a single bulk read
    if len(raw) != data_needed:
        raise ValueError(f"{filename!r} is too small to contain a {nx}x{ny} image")
    arr = np.frombuffer(raw)  # Convert from raw buffer to a single numpy array
    real_image = arr[::2].reshape(nx,ny)  # Slice and reshape to desired format
    imaginary_image = arr[1::2].reshape(nx,ny)
    return real_image, imaginary_image

That replaces a bunch of relatively slow Python level manipulation with a very fast:

  1. Bulk read of all the data
  2. Bulk conversion of the data to a single numpy array (it doesn't even unpack it properly, it just interprets it in-place as being of the expected type, which defaults to "float", actually C doubles)
  3. Slicing and reshaping appropriately

No need for numba; on my local box, for a 2048x2048 call, your code took ~1.75 seconds, this version took ~10 milliseconds.

2
Andrej Kesely On

Here is one version with np.memmap - the reading on my system with plenty of RAM and nvme drive takes 0.00012 sec (but maybe the system has cached the file in memory, so test it on your setup).

import numpy as np


def create_test_binary_file(filename, nx=2048, ny=2048):
    arr = np.zeros(shape=2 * nx * ny, dtype=np.float64)
    arr[::2] = 1
    arr[1::2] = 2

    with open(filename, "wb") as f_out:
        f_out.write(arr.tobytes())


def read_file(filename, nx=2048, ny=2048):
    arr = np.memmap(filename, dtype=np.float64, mode="r", shape=2 * nx * ny)

    return arr[::2].reshape(nx, ny), arr[1::2].reshape(nx, ny)


create_test_binary_file("test.dat")

r, i = read_file("test.dat")
print(r, i, sep="\n\n")

Prints:

[[1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 ...
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]]

[[2. 2. 2. ... 2. 2. 2.]
 [2. 2. 2. ... 2. 2. 2.]
 [2. 2. 2. ... 2. 2. 2.]
 ...
 [2. 2. 2. ... 2. 2. 2.]
 [2. 2. 2. ... 2. 2. 2.]
 [2. 2. 2. ... 2. 2. 2.]]