How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

Question

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

4.6k views Asked by Ricky At 19 November 2014 at 03:12

I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included?

I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint.

Original Q&A

There are 2 answers

Vezir On 19 February 2016 at 09:44

An another way of filtering a corpus; First assign your value to the meta part, say language; by looping elements of the corpus with the variable i, check whatever you want, then filter by using with these meta attribute.

corpusz[[i]]$meta["language"] <- 'tur'

idx <- meta(corpusz, "language") ==  'tur'
filtered <- corpusz[idx]

Now filtered containes only the corpus elements we want.

**eipi10** · Accepted Answer · 2014-11-19T04:40:52+00:00

You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the content_transformer function for more information:

library(tm)

# Create a corpus from the text listed below
corp = VCorpus(VectorSource(doc))

# Custom function to keep only the terms in "pattern" and remove everything else
(f <- content_transformer(function(x, pattern) 
  regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))

(FYI, the second line of code just above is adapted from this SO answer.)

# The pattern we'll search for
keep = "sleep|dream|die"

# Run the transformation function using the pattern above
tm_map(corp, f, keep)[[1]]

Here's the result of running the transformation function:

<<PlainTextDocument (metadata: 7)>>
  c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")

Here's the original text I used to create the corpus:

doc = "To be, or not to be, that is the question—
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing, end them? To die, to sleep—
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub"

TechQA.

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

There are 2 answers

Related Questions in R

Related Questions in TM

Related Questions in CORPUS

Related Questions in TERM-DOCUMENT-MATRIX

Popular Questions

Popular Tags

Trending Questions