I have a 1D array of around 63M integer elements that represent a 4D dataset with axes x, y, channel, frame (all positive integers). The shape of the dataset is (512, 512, 4069, 239). The dataset represents an x-ray spectrum stream, where x and y are electron beam raster position indexes and each channel represents a bin of x-ray energies. The sample is scanned over multiple times and each frame represents such a (x, y, channel) dataset.
To be able to manipulate this data easily and slice it through the various axes, it would be preferable to convert the data to either the 4D dataset or to be able to create 3D slices on the fly.
The data in the 1D array is structured as follows: every element is either a very large integer x (about 65000) or an integer that represents a channel index (up to 4069). Each x stored in the array corresponds to a scan position (x,y), so the first x represents (0,0), the second (1,0), and so on. The shape of the scan grid was 512x512 pixels, so once (511,0) is reached, the next x corresponds to (0, 1). If another value than x is encountered, it means that the channels corresponding with those values received a count at the beam position (x,y) corresponding to the preceding x. Once index (511,511) is surpassed, the next frame begins. The array is very sparse as in each frame there are only about 3000 X-ray counts in total, spread over 4069 channels. So the 1D array is basically 239 frames, each represented by 512x512 + ~3000 integers, all strung together.
To illustrate, the 1D dataset may look like
[65000, 65000, 1, 65000, 65000, 65000, 2, 3, 65000, 2, 2, 65000, 65000, ...]
This should be interpreted as
Frame 1
| Channel
x y | 1 2 3 ... 4069
_____________________________
0 0 | 0 0 0
1 0 | 1 0 0
2 0 | 0 0 0
3 0 | 0 0 0
4 0 | 0 1 1
5 0 | 0 2 0
6 0 | 0 0 0
7 0 | 0 0 0
... |
511 511|
Frame 2
...
Frame 239
...
What I want to be able to do:
- Recreate the full 4D dataset to see if it fits in memory and can be worked with. I suspect it would require 4069 times the amount of memory that the 1D array requires.
- Create (x , y, channel) slices of one frame on the fly
- Create an (x, y, channel) array summed over the frames
I know how to do this with loops but am afraid I will start running into memory or performance issues. Even if I don't for this dataset, future datasets may be even larger, so I want to do it the right way from the beginning.
My questions is: is there a nice pythonic, efficient way to do this type of data transformation? Something that avoids a loop over all the elements?
I would also appreciate tips on best practice for ease of working with the data while still keeping good performance and managing memory. Would it be best to store the entire 4D dataset and perform slices and manipulations on it, or better to only store the 1D array and do frame by frame translation when it is needed for some calculation? Given that the data is very sparse, is there another way to store the data so it will use up less memory than the 4D array, but can still be easily viewed/accesed/manipulated?
I found out a way to do it, still with a loop but avoiding a loop over all elements. I measured it to be about 5 times faster than the loop over all elements, so I thought I would share. The key is finding the indexes where there where counts using numpy.argwhere, which I found to be much faster than looping over all elements. Finding which pixel index these counts should be mapped to can also be done in a vectorized way.
This yields a sparse matrix with rows that denote the position (together with information about the scan area and the number of frames, each row denotes a unique (x,y,frame)) and columns that denote the channel. Reshaping this when necessary is trivial.