Applying scipy.sparse.linalg.svds throws a Memory Error?

570 views Asked by At

I try to decompose a sparse matrix(40,000×1,400,000) with scipy.sparse.linalg.svds on my 64-bit machine with 140GB RAM. as following:

k = 5000
tfidf_mtx = tfidf_m.tocsr()
u_45,s_45,vT_45 = scipy.sparse.linalg.svds(tfidf_mtx, k=k)

When the K ranges from 1000 to 4500, it works. But the K is 5000, it throws an MemoryError.The precise error is given below:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-6-31a69ce54e2c> in <module>()
      4 k = 4000
      5 tfidf_mtx = tfidf_m.tocsr()
----> 6 get_ipython().magic(u'time u_50,s_50,vT_50 =linalg.svds(tfidf_mtx, k=k))
      7 # print len(s),s
      8 

/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
   2163         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2164         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2165         return self.run_line_magic(magic_name, magic_arg_s)
   2166 
   2167     #-------------------------------------------------------------------------

/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
   2084                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2085             with self.builtin_trap:
-> 2086                 result = fn(*args,**kwargs)
   2087             return result
   2088 

/usr/lib/python2.7/dist-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)

/usr/lib/python2.7/dist-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    189     # but it's overkill for just that one bit of state.
    190     def magic_deco(arg):
--> 191         call = lambda f, *a, **k: f(*a, **k)
    192 
    193         if callable(arg):

/usr/lib/python2.7/dist-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
   1043         else:
   1044             st = clock2()
-> 1045             exec code in glob, local_ns
   1046             end = clock2()
   1047             out = None

<timed exec> in <module>()

/usr/local/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.pyc in svds(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors)
   1751         else:
   1752             ularge = eigvec[:, above_cutoff]
-> 1753             vhlarge = _herm(X_matmat(ularge) / slarge)
   1754 
   1755         u = _augmented_orthonormal_cols(ularge, nsmall)

/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.pyc in dot(self, other)
    244 
    245         """
--> 246         return self * other
    247 
    248     def __eq__(self, other):

/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.pyc in __mul__(self, other)
    298                 return self._mul_vector(other.ravel()).reshape(M, 1)
    299             elif other.ndim == 2 and other.shape[0] == N:
--> 300                 return self._mul_multivector(other)
    301 
    302         if isscalarlike(other):

/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.pyc in _mul_multivector(self, other)
    463 
    464         result = np.zeros((M,n_vecs), dtype=upcast_char(self.dtype.char,
--> 465                                                         other.dtype.char))
    466 
    467         # csr_matvecs or csc_matvecs

MemoryError: 

The when the k is 3000 and 4500, the ratio of the sum of the square of singular values to the sum of the square of all matrix entities is respectively 0.7033 and 0.8230. I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.

1

There are 1 answers

1
hpaulj On BEST ANSWER

So the return is an (M,k) array. On an ordinary older machine:

In [368]: np.ones((40000,1000))
....
In [369]: np.ones((40000,4000))
...
In [370]: np.ones((40000,5000))
 ...
--> 190     a = empty(shape, dtype, order)
    191     multiarray.copyto(a, 1, casting='unsafe')
    192     return a
MemoryError: 

Now may just be a coincidence that I hit the memory error at the same size are your code. But if you make the problem big enough you will hit memory errors at some point.

Your stacktrace shows the error occurs while multiplying a sparse matrix and a dense 2d array (other), and the result will be dense as well.