Spectral clustering with sklearn and a big affinity matrix

6.5k views Asked by At

I am trying to use the spectral clustering method provided by scikit-learn to aggregate the rows of my dataset (which are only 16000). My issue arises after I precompute the affinity matrix (a 16000x16000 float matrix) which allocates 3 gigabyte more or less (I can reach up to 8 GB at most), the method called on that matrix, with the argpack solver, requires much more memory and the cpu loses so much time trying to swap things in and out of memory that the computation is slowed down to death. I've also tried to call the garbage collector before the method but without success:

import gc
gc.collect()

How can I get the precise scheme of what is going on? Is this a known issue? Are there any alternatives in python to perform spectral clustering out of the box?

I can post a minimal working example if needed.

UPDATE I was wrong concerning the lobpcg solver, at first it seems to use all my memory but then it stabilizes around 5Gb and the process continues, but another issue rises. It seems like the computation leads to some numerical inaccuracies that in the end generate Nans or an error like this one (the error that comes up changes while the affinity matrix changes a little, see below):

File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/spectral.py", line 255, in spectral_clustering
eigen_tol=eigen_tol, drop_first=False)
File "/usr/local/lib/python3.4/dist-packages/sklearn/manifold/spectral_embedding_.py", line 303, in spectral_embedding
largest=False, maxiter=2000)
File "/usr/local/lib/python3.4/dist-packages/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py", line 449, in lobpcg
assert np.allclose(gramA.T, gramA)
AssertionError

My affinity metric is a gaussian kernel exp(-((X_1 - x_2)^2/2*sigma^2)) and the results vary a lot when trying different values for sigma. I can understand that some inaccuracies can arise when some entries are close to zero, but that is not the case when I use large values for sigma (= 5.0), having them closer to 1.0 instead.

Now I am supposing that I am missing some point or I am doing something wrong in the process, so I'm updating this question with a new aim.

For reference, this is how I compute the affinity matrix:

 pairwise_dists = \
      scipy.spatial.distance.squareform(
           scipy.spatial.distance.pdist(data_slice,'sqeuclidean'))
 similarity_matrix = scipy.exp(-pairwise_dists /(2 * self._sigma ** 2))

where data_slice is a numpy array with its rows being my instances and self._sigma stores the gaussian sigma parameter.

1

There are 1 answers

3
Has QUIT--Anony-Mousse On

Spectral clustering computes Eigenvectors of the dissimilarity matrix.

This matrix has size O(n^2), and thus pretty much any implementation will need O(n^2) memory.

16000x16000x4 (assuming float storage, and no overhead) is about 1 GB. It probably needs a working copy (methods such as scipy.exp will likely produce a copy of your matrix; and maybe with double precision), and some overhead, which is why you end up with using 3 GB...

This algorithm just is not applicable for large data, just like any other algorithm requiring O(n^2) memory. Choose a different algorithm; maybe one that can be accelerated using index structures. Or reduce your data set size, for example by sampling.