TermDocumentMatrix as.matrix uses large amounts of memory

1.2k views Asked by At

I'm currently using the tm package to extract out terms to cluster on for duplicate detection in a decently sized database of 25k items (30Mb) this runs on my desktop, but when I try to run it on my server It seems to take an ungodly amount of time. On closer inspection I found that I had blown through 4GB of swap running the line apply(posts.TmDoc, 1, sum) to calculate the frequencies of the terms. Furthermore even running as.matrix generates a document of 3GB on my desktop see https://i.stack.imgur.com/yCqVf.jpg

Is this necessary just to generate a frequency count for 18k terms on 25k items? Is there any other way to generate the frequency count without coercing the TermDocumentMatrix to a matrix or a vector?

I cannot remove terms based on sparseness as that's how the actual algorithim is implemented. It looks for terms that are common to at least 2 but not more than 50 and groups on them, calculating a similarity value for each group.

Here is the code in context for reference

min_word_length = 5
max_word_length = Inf
max_term_occurance = 50
min_term_occurance = 2


# Get All The Posts
Posts = db.getAllPosts()
posts.corpus = Corpus(VectorSource(Posts[,"provider_title"]))

# remove things we don't want
posts.corpus = tm_map(posts.corpus,content_transformer(tolower))
posts.corpus = tm_map(posts.corpus, removePunctuation)
posts.corpus = tm_map(posts.corpus, removeNumbers)
posts.corpus = tm_map(posts.corpus, removeWords, stopwords('english'))

# grab any words longer than 5 characters
posts.TmDoc = TermDocumentMatrix(posts.corpus, control=list(wordLengths=c(min_word_length, max_word_length)))

# get the words that occur more than once, but not more than 50 times
clustterms = names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance  & apply(posts.TmDoc, 1, sum) < max_term_occurance))
1

There are 1 answers

1
Matt Bucci On

Because I never actually need the frequency counts I can use the findFreqTerms command

setdiff(findFreqTerms(posts.TmDoc, 2), findFreqTerms(posts.TmDoc, 50))

is the same as

names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance  & apply(posts.TmDoc, 1, sum) < max_term_occurance))

but runs near instantaneously