Small subset of huge matrix-like structure from disk transparently

118 views Asked by At

A simplified version of the question

I have a huge matrix-like dataset, that we for now can pretend is actually an n-by-n matrix stored on-disk as n^2 IEEE-754 doubles (see details below the line on how this is a simplification - it probably matters). The file is on the order of a gigabyte, but in a certain (pure) function I will only need on the order of n of the elements contained in it. Exactly which elements will be needed is complicated, and not something like a simple slice.

What are my options for decoupling reading the file from disk and the computation? Most of all, I'd like to treat the on-disk data as if it were in memory (I am of course ready to swear to all the gods of referential transparency that the data on disk will not change). I've looked at mmap and friends, but some cursory testing shows that these seem not to aggressively enough free memory.

Do I have to go couple my computations to IO if I need such fine-grained control of how much of the file is kept in memory?


A more honest description of the on-disk data

The data on disk isn't actually as simple as described. Something closer to the truth would be the following: A file begins with a 32 bit integer n. The following then occurs precisely n times: A 32 bit integer m_i > 0 (1 ≤ i ≤ n), followed by exactly m_i IEEE-754 doubles x_(i,1),…,x_(i, m_i). (So, this is a jagged two-dimensional array).

In practice, determining i and j for which x_(i, j) is needed depends highly on the m_i's. When approaching the problem with mmap, the need to read so many of these m_is seems to essentially load the entire file into memory. The problem is that it all seems to stay there, and I worry that I will have to pull my computation into IO to have more fine-grained control over the releasing of this memory.

Moreover, "the data structure" actually consists of a large number of these files parameterized by their file names. Together they amount to about a gigabyte.


An attempt at a more handwaving, but possibly easier to understand version of the question

Say I have some data on disk consisting of n^2 elements. A pure Haskell function needs on the order of n of the elements, but which of them depends in a complicated way on the values. I do not want to load the entire file into memory, because it is huge. One solution is to throw my function into the IO monad and read out elements as they are needed, but I call this "giving up". mmap lets us treat on-disk data as if it were in memory, essentially doing lazy IO with help from the OS' virtual memory system. This is nice, but since determining which elements of the data are needed requires accessing a lot of the file, mmap seems to keep way too much of the file in memory. In practice, I find that reading the data I need to determine the data I actually need loads the entire file into memory when using mmap.

What options do I have?

1

There are 1 answers

0
sclv On

I would suggest that you write an interface that is entirely in IO, where you have an abstract type that contains both a Handle and information about the overall structure of your data (perhaps all the m_is if you can fit them), and this is complemented by IO operations that read out precise bits of the data by seeking in the handle.

I would then simply wrap this interface in a bunch of unsafePerformIO calls! This is effectively what mmap does behind the scenes, in a sense. You just are doing so in a more explicitly managed way.

Assuming you aren't worried about anyway "swapping out" the file behind your back, you can get an interface that you can reason about purely while it actually does IO where necessary to give the explicit control over memory you need.