I have following situation that I want to address using Python
(preferably using numpy
and scipy
):
- Collection of documents that I want to convert to a sparse term document matrix.
- Extract sparse vector representation of each document (i.e. a row in the matrix) and find out top 10 similary documents using cosine similarity within certain subset of documents (documents are labelled with categories and I want to find similar documents within the same category).
How do I achieve this in Python
? I know I can use scipy.sparse.coo_matrix
to represent documents as sparse vectors and take dot product to find cosine similarity, but how do I convert the entire corpus to a large but sparse term document matrix (so that I can also extract it's rows as scipy.sparse.coo_matrix
row vectors)?
Thanks.
May I recommend you take a look at scikit-learn? This is a very well regarded library in the Python community with a very simple a consistent API. They have also implemented a cosine similarity metric. This is an example taken from here of how you could do it in 3 lines of code: