Term document matrix and cosine similarity in Python

Question

Term document matrix and cosine similarity in Python

8.2k views Asked by abhinavkulkarni At 07 August 2013 at 20:40

I have following situation that I want to address using Python (preferably using numpy and scipy):

Collection of documents that I want to convert to a sparse term document matrix.
Extract sparse vector representation of each document (i.e. a row in the matrix) and find out top 10 similary documents using cosine similarity within certain subset of documents (documents are labelled with categories and I want to find similar documents within the same category).

How do I achieve this in Python? I know I can use scipy.sparse.coo_matrix to represent documents as sparse vectors and take dot product to find cosine similarity, but how do I convert the entire corpus to a large but sparse term document matrix (so that I can also extract it's rows as scipy.sparse.coo_matrix row vectors)?

Thanks.

Original Q&A

There are 2 answers

Gunjan On 20 September 2013 at 11:00

you can refer to this question

Python: tf-idf-cosine: to find document similarity

I have answered the question in which you can find the cosine similarity with scikit package.

**elyase** · Accepted Answer · 2013-08-07T21:38:18+00:00

May I recommend you take a look at scikit-learn? This is a very well regarded library in the Python community with a very simple a consistent API. They have also implemented a cosine similarity metric. This is an example taken from here of how you could do it in 3 lines of code:

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> vect = TfidfVectorizer(min_df=1)
>>> tfidf = vect.fit_transform(["I'd like an apple",
...                             "An apple a day keeps the doctor away",
...                             "Never compare an apple to an orange",
...                             "I prefer scikit-learn to Orange"])
>>> (tfidf * tfidf.T).A
array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
       [ 0.25082859,  1.        ,  0.22057609,  0.        ],
       [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
       [ 0.        ,  0.        ,  0.26264139,  1.        ]])

TechQA.

Term document matrix and cosine similarity in Python

There are 2 answers

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in SCIPY

Related Questions in TERM-DOCUMENT-MATRIX

Popular Questions

Popular Tags

Trending Questions