Python: how to turn list of word counts into format suitable for CountVectorizer

433 views Asked by At

I have ~100,000 lists of strings of the form:
['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'] etc.
which essentially makes up my corpus. Each list contains the words from a document and their word counts.

How can I put this corpus into a form that I can feed into CountVectorizer?

Is there a quicker way than turning each list into a string containing 'the' 652 times, 'of' 216 times, etc.?

1

There are 1 answers

2
Elisha On BEST ANSWER

Assuming that what you're trying to achieve is a vectorized corpus in sparse matrix format, along with a trained vectorizer, you can simulate the vectorization process without repeating the data:

from scipy.sparse.lil import lil_matrix
from sklearn.feature_extraction.text import CountVectorizer

corpus = [['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'],
          ['king: 20', 'of: 16', 'the: 400', 'jungle: 110']]


# Prepare a vocabulary for the vectorizer
vocabulary = {item.split(':')[0] for document in corpus for item in document}
indexed_vocabulary = {term: index for index, term in enumerate(vocabulary)}
vectorizer = CountVectorizer(vocabulary=indexed_vocabulary)

# Vectorize the corpus using the coordinates known to the vectorizer
X = lil_matrix((len(corpus), len(vocabulary)))
X.data = [[int(item.split(':')[1]) for item in document] for document in corpus]
X.rows = [[vectorizer.vocabulary[(item.split(':')[0])] for item in document]
          for document in corpus]

# Convert the matrix to csr format to be compatible with vectorizer.transform output
X = X.tocsr()

In this example, the output will be:

[[ 168.  216.    0.  159.  652.  145.    0.]
 [   0.   16.  110.    0.  400.    0.   20.]]

This can allow further documents vectorization:

vectorizer.transform(['jungle kid is programming', 'the jungle machine learning jungle'])

Which yields:

[[0 0 1 0 0 1 0]
 [0 0 2 0 1 0 0]]