Non-negative matrix factorization of sparse input

2.5k views Asked by At

Could anyone recommend set of tools to perform standard NMF application onto sparse input data [ matrix of size 50kx50k ], thanks!

1

There are 1 answers

3
Fred Foo On

scikit-learn has an implementation of NMF for sparse matrices. You will need the bleeding-edge version from GitHub, though, since all released versions (up to and including 0.14) had a scalability problem. A demo follows.

Load some data: the twenty newsgroups text corpus.

>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.decomposition import NMF
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.preprocessing import normalize
>>> data = fetch_20newsgroups().data
>>> X = CountVectorizer(dtype=float).fit_transform(data)
>>> X = normalize(X)
>>> X
<11314x130107 sparse matrix of type '<type 'numpy.float64'>'
    with 1787565 stored elements in Compressed Sparse Column format>

Now fit an NMF model with 10 components.

>>> nmf = NMF(n_components=10, tol=.01)
>>> Xnmf = nmf.fit_transform(X)

I tweaked the tolerance option to make this convergence in a few seconds. With the default tolerance, it takes quite a bit longer. The memory usage for this problem is around 360MB.

Disclaimer: I'm a contributor to this implementation, so this is not unbiased advice.