I have ~100,000 lists of strings of the form:
['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145']
etc.
which essentially makes up my corpus. Each list contains the words from a document and their word counts.
How can I put this corpus into a form that I can feed into CountVectorizer?
Is there a quicker way than turning each list into a string containing 'the' 652 times, 'of' 216 times, etc.?
Assuming that what you're trying to achieve is a vectorized corpus in sparse matrix format, along with a trained vectorizer, you can simulate the vectorization process without repeating the data:
In this example, the output will be:
This can allow further documents vectorization:
Which yields: