How much matrix size the function Spectral clustering of Scikit learn can handle?

3.3k views Asked by At

I am using function 'Spectral clustering' of scikit learn. I am able to perform clustering for 8100 by 8100 matrix but this function throws error for 10000 by 10000 matrix.

Has anyone used this function for large matrix?

edit: I got following error message:

    Not enough memory to perform factorization.
    Traceback (most recent call last):
    File "combined_code_img.py", line 287, in <module>
    labels=spectral.fit_predict(Affinity)
    File "/root/anaconda/lib/python2.7/site-packages/sklearn/base.py",        
    line 410, in fit_predict
    self.fit(X)
    File "/root/anaconda/lib/python2.7/site-packages/sklearn/cluster/spectral.py", line 463, in fit
    assign_labels=self.assign_labels)
    File "/root/anaconda/lib/python2.7/site-packages/sklearn/cluster/spectral.py", line 258, in spectral_clustering
    eigen_tol=eigen_tol, drop_first=False)
    File "/root/anaconda/lib/python2.7/site-packages/sklearn/manifold/spectral_embedding_.py", line 265, in spectral_embedding
tol=eigen_tol, v0=v0)
    File "/root/anaconda/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1560, in eigsh
symmetric=True, tol=tol)
    File "/root/anaconda/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1046, in get_OPinv_matvec
    return SpLuInv(A.tocsc()).matvec
    File "/root/anaconda/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 907, in __init__
    self.M_lu = splu(M)
    File "/root/anaconda/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py", line 261, in splu
ilu=False, options=_options)
    MemoryError

My machine is having 16 GB ram.

1

There are 1 answers

1
rth On

Spectral clustering algorithm has ~ O(n³) time complexity, and a fairly bad space complexity, since you are running out for memory with 16 GB RAM to process a ~0.8 GB dataset (10000x10000 array, assuming 64bit floats). It is therefore not suitable for large datasets.

Instead you should use a clustering algorithm that scales better. See the below benchmarks taken from the HDBSCAN documentation, clustering scaling

For instance, MiniBatchKMeans, DBSCAN from scikit-learn or HDBSCAN would scale better.