I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm
package, where only terms I specify up front are to be used and included?
I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint.
You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the
tm
package and the help for thecontent_transformer
function for more information:(FYI, the second line of code just above is adapted from this SO answer.)
Here's the result of running the transformation function:
Here's the original text I used to create the corpus: