fast downsampling of huge matrix using python (numpy memmap, pytables or other?)

1k views Asked by At

As part of my data processing I produce huge non sparse matrices in the order of 100000*100000 cells, which I want to downsample by a factor of 10 to reduce the amount of data. In this case I want to average over blocks of 10*10 pixels, to reduce the size of my matrix from 100000*100000 to 10000*10000.

What is the fastest way to do so using python? It does not matter for me if I need to save my original data to a new dataformat, because I have to do the downsampling of the same dataset multiple times.

Currently I am using numpy.memmap:

import numpy as np

data_1 = 'data_1.dat'
date_2 = 'data_2.dat'
lines = 100000
pixels = 100000
window = 10

new_lines = lines / window
new_pixels = pixels / window
dat_1 = np.memmap(data_1, dtype='float32', mode='r', shape=(lines, pixels))
dat_2 = np.memmap(data_2, dtype='float32', mode='r', shape=(lines, pixels))

dat_in = dat_1 * dat_2
dat_out = dat_in.reshape([new_lines, window, new_pixels, window]).mean(3).mean(1)

But with with large files this method becomes very slow. Likely this has something to do with the binary data of these files, which are ordered by line. Therefore, I think that a data format which stores my data in blocks instead of lines will be faster, but I am not sure what the performance gain will be and whether there are python packages who support this.

I have also thought about downsampling of the data before creating such a huge matrix (not shown here), but my input data is fractured and irregular, so that would become very complex.

2

There are 2 answers

0
Daniel F On BEST ANSWER

Based on this answer, I think this might be a relatively fast method, depending on how much overhead reshape gives you with memmap.

def downSample(a, window):
     i, j = a.shape
     ir = np.arange(0, i, window)
     jr = np.arange(0, j, window)
     n = 1./(window**2)
     return n * np.add.reduceat(np.add.reduceat(a, ir), jr, axis=1)

Hard to test speed without your dataset.

0
Eric On

This avoids an intermediate copy, as the reshape keeps dimensions contiguous

dat_in.reshape((lines/window, window, pixels/window, window)).mean(axis=(1,3))