Problem description
In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.
The number of particles (n_particles
) is 10-100. The number of time-steps (time_size
) is ~10^9.
The simulation has 3 steps (the code below is for an all-in-RAM version):
Simulate (and store) an
emission
rate array (contains many almost-0 elements):- shape (
n_particles
xtime_size
), float32, size 80GB
- shape (
Compute
counts
array, (random values from a Poisson process with previously computed rates):shape (
n_particles
xtime_size
), uint8, size 20GBcounts = np.random.poisson(lam=emission).astype(np.uint8)
Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.
# Loop across the particles timestamps = [np.nonzero(c) for c in counts]
I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission
(apply cumsum
or other functions) before computing counts
.
Question
I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.
What I would like it exist
I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).
Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max
, cumsum
, ...) to work transparently, requiring only a chunksize
keyword to specify how much of the array to load at each iteration.
Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.
What are the practical options
As a first option I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.
So my options for which I seek an answer are
implement a numpy array with required functionality (how?)
use pytable in a smarter way (different data-structures/methods)
use another library: h5py, blaze, pandas... (I haven't tried any of them so far).
Tentative solution (pyTables)
I save the simulation parameters in '/parameters'
group: each parameter is converted to a numpy array scalar. Verbose solution but it works.
I save emission
as an Extensible array (EArray
), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts
is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2')
. I'm not sure this is space-efficient, and
generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):
dt = np.dtype([('counts', 'u1')])
for ip in xrange(n_particles):
name = "particle_%d" % ip
data_file.create_table(
group, name, description=dt, chunkshape=chunksize,
expectedrows=time_size,
title='Binned timetrace of emitted ph (bin = t_step)'
' - particle_%d' % particle)
Each particle-counts "table" has a different name (name = "particle_%d" % ip
) and that I need to put them in a python list for easy iteration.
EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.
PyTable Solution
Since functionality provided by Pandas is not needed, and the processing is much slower (see notebook below), the best approach is using PyTables or h5py directly. I've tried only the pytables approach so far.
All tests were performed in this notebook:
Introduction to pytables data-structures
Pytables allows store data in HDF5 files in 2 types of formats: arrays and tables.
Arrays
There are 3 types of arrays
Array
,CArray
andEArray
. They all allow to store and retrieve (multidimensional) slices with a notation similar to numpy slicing.For optimization in some use cases,
CArray
is saved in "chunks", whose size can be chosen withchunk_shape
at creation time.Array
andCArray
size is fixed at creation time. You can fill/write the array chunk-by-chunk after creation though. ConverselyEArray
can be extended with the.append()
method.Tables
The
table
is a quite different beast. It's basically a "table". You have only 1D index and each element is a row. Inside each row there are the "columns" data types, each columns can have a different type. It you are familiar with numpy record-arrays, a table is basically an 1D record-array, with each element having many fields as the columns.1D or 2D numpy arrays can be stored in tables but it's a bit more tricky: we need to create a row data type. For example to store an 1D uint8 numpy array we need to do:
So why using tables? Because, differently from arrays, tables can be efficiently queried. For example, if we want to search for elements > 3 in a huge disk-based table we can do:
Not only it is simple (compared with arrays where we need to scan the whole file in chunks and build
index
in a loop) but it is also very extremely fast.How to store simulation parameters
The best way to store simulation parameters is to use a group (i.e.
/parameters
), convert each scalar to numpy array and store it asCArray
.Array for "
emission
"emission
is the biggest array that is generated and read sequentially. For this usage pattern A good data structure isEArray
. On "simulated" data with ~50% of zeros elements blosc compression (level=5
) achieves 2.2x compression ratio. I found that a chunk-size of 2^18 (256k) has the minimum processing time.Storing "
counts
"Storing also "
counts
" will increase the file size by 10% and will take 40% more time to compute timestamps. Havingcounts
stored is not an advantage per-se because only the timestamps are needed in the end.The advantage is that recostructing the index (timestamps) is simpler because we query the full time axis in a single command (
.get_where_list('counts >= 1')
). Conversely, with chunked processing, we need to perform some index arithmetics that is a bit tricky, and maybe a burden to maintain.However the the code complexity may be small compared to all the other operations (sorting and merging) that are needed in both cases.
Storing "
timestamps
"Timestamps can be accumulated in RAM. However, we don't know the arrays size before starting and a final
hstack()
call is needed to "merge" the different chunks stored in a list. This doubles the memory requirements so the RAM may be insufficient.We can store as-we-go timestamps to a table using
.append()
. At the end we can load the table in memory with.read()
. This is only 10% slower than all-in-memory computation but avoids the "double-RAM" requirement. Moreover we can avoid the final full-load and have minimal RAM usage.H5Py
H5py is a much simpler library than pytables. For this use-case of (mainly) sequential processing seems a better fit than pytables. The only missing feature is the lack of 'blosc' compression. If this results in a big performance penalty remains to be tested.