I was expecting scipy's sparse matrices to use a lot less memory than the na(t)ive list of lists representation, but experiments have proven me wrong. In the snippet below, I'm building a random binary matrix with around 75% zeroes, and then comparing the memory usage with each available representation in scipy.sparse (simplified IPython session):
# %load_ext memory_profiler, from scipy import sparse, etc.
# ...
%memit M = random_binary_matrix(5000, 5000) # contains ints
peak memory: 250.36 MiB, increment: 191.77 MiB
In : sum(line.count(0) for line in M) / (len(M) * len(M[0]))
Out: 0.75004468
%memit X_1 = sparse.bsr_matrix(M)
peak memory: 640.49 MiB, increment: 353.76 MiB
%memit X_2 = sparse.coo_matrix(M)
peak memory: 640.71 MiB, increment: 286.09 MiB
%memit X_3 = sparse.csc_matrix(M)
peak memory: 807.51 MiB, increment: 357.53 MiB
%memit X_4 = sparse.csr_matrix(M)
peak memory: 840.04 MiB, increment: 270.91 MiB
%memit X_5 = sparse.dia_matrix(M)
peak memory: 1075.20 MiB, increment: 386.87 MiB
%memit X_6 = sparse.dok_matrix(M)
peak memory: 3059.86 MiB, increment: 1990.62 MiB
%memit X_7 = sparse.lil_matrix(M)
peak memory: 2774.67 MiB, increment: 385.39 MiB
Am I doing something wrong? Am I missing something (including the point of these alternative representations)?
... or is memory_profiler, or my lack of comprehension thereof, to blame? In particular, the relationship between "peak memory" and "increment" seems dubious at times: initialising X_2 supposedly increments memory usage by 286.09 MiB, yet the peak memory usage is barely above what it was prior to executing that line.
If it matters: I'm running Debian 12, Python 3.11.2, IPython 8.5.0, scipy 1.10.1, memory_profiler 0.61.0
Creating a sparse matrix from a dense matrix results in a fully populated sparse matrix (including explicit zeros). This is because scipy doesn't know the tolerance that you want to use for excluding values close to zero. Arguably, it should come with a factory method that does this for you, but we can always build one manually.
The sparse matrices come with many different combinations of constructor arguments. It seems that you want to exclude only explicit zeros. For this, something like this should work fine:
With other tolerances you could use this:
You can also create a fully populated sparse matrix first and then call
eliminate_zeros()on it but this will result in temporarily higher memory usage.