fast downsampling of huge matrix using python (numpy memmap, pytables or other?)

Question

fast downsampling of huge matrix using python (numpy memmap, pytables or other?)

1k views Asked by Gert At 18 May 2017 at 08:40

As part of my data processing I produce huge non sparse matrices in the order of 100000*100000 cells, which I want to downsample by a factor of 10 to reduce the amount of data. In this case I want to average over blocks of 10*10 pixels, to reduce the size of my matrix from 100000*100000 to 10000*10000.

What is the fastest way to do so using python? It does not matter for me if I need to save my original data to a new dataformat, because I have to do the downsampling of the same dataset multiple times.

Currently I am using numpy.memmap:

import numpy as np

data_1 = 'data_1.dat'
date_2 = 'data_2.dat'
lines = 100000
pixels = 100000
window = 10

new_lines = lines / window
new_pixels = pixels / window
dat_1 = np.memmap(data_1, dtype='float32', mode='r', shape=(lines, pixels))
dat_2 = np.memmap(data_2, dtype='float32', mode='r', shape=(lines, pixels))

dat_in = dat_1 * dat_2
dat_out = dat_in.reshape([new_lines, window, new_pixels, window]).mean(3).mean(1)

But with with large files this method becomes very slow. Likely this has something to do with the binary data of these files, which are ordered by line. Therefore, I think that a data format which stores my data in blocks instead of lines will be faster, but I am not sure what the performance gain will be and whether there are python packages who support this.

I have also thought about downsampling of the data before creating such a huge matrix (not shown here), but my input data is fractured and irregular, so that would become very complex.

Original Q&A

There are 2 answers

Eric On 18 May 2017 at 10:05

This avoids an intermediate copy, as the reshape keeps dimensions contiguous

dat_in.reshape((lines/window, window, pixels/window, window)).mean(axis=(1,3))

**Daniel F** · Accepted Answer · 2017-05-18T10:55:16+00:00

Based on this answer, I think this might be a relatively fast method, depending on how much overhead reshape gives you with memmap.

def downSample(a, window):
     i, j = a.shape
     ir = np.arange(0, i, window)
     jr = np.arange(0, j, window)
     n = 1./(window**2)
     return n * np.add.reduceat(np.add.reduceat(a, ir), jr, axis=1)

Hard to test speed without your dataset.

TechQA.

fast downsampling of huge matrix using python (numpy memmap, pytables or other?)

There are 2 answers

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in DOWNSAMPLING

Related Questions in DATA-FORMATS

Related Questions in BIGDATA

Popular Questions

Popular Tags

Trending Questions