index milion row square matrix for fast access

170 views Asked by At

I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...). I started looking at hdf5 and blaze in combination with numpy and pandas:

But I found it a bit complicated, and I am not sure if it is the best solution.

Are there other solutions?

thanks

EDIT

Here some more specifications about the kind of data I am dealing with.

  • The matrices are usually sparse (< 10% or < 25% of cells with non-zero)
  • The matrices are symmetric

And what I would need to do is:

  • Access for reading only
  • Extract rectangular sub-matrices (mostly along the diagonal, but also outside)
2

There are 2 answers

0
cromod On

Did you try PyTables ? It can be very useful for very large matrix. Take a look to this SO post.

0
Eelco Hoogendoorn On

Your question is lacking a bit in context; but hdf5 compressed block storage is probably as-efficient as a sparse storage format for these relatively dense matrices you describe. In memory, you can always cast your views to sparse matrices if it pays. That seems like an effective and simple solution; and as far as I know there are no sparse matrix formats which can easily be read partially from disk.