LDA with tm package in R using bigrams

Question

LDA with tm package in R using bigrams

1.8k views Asked by dulla At 11 June 2015 at 06:24

I have a csv with every row as a document. I need to perform LDA upon this. I have the following code :

library(tm)
library(SnowballC)
library(topicmodels)
library(RWeka)

X = read.csv('doc.csv',sep=",",quote="\"",stringsAsFactors=FALSE)

corpus <- Corpus(VectorSource(X))
corpus <- tm_map(tm_map(tm_map(corpus, stripWhitespace), tolower), stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- DocumentTermMatrix(corpus, control = list(tokenize=BigramTokenizer,weighting=weightTfIdf))

At this point checking the dtm object gives

<<DocumentTermMatrix (documents: 52, terms: 477)>>
Non-/sparse entries: 492/24312
Sparsity           : 98%
Maximal term length: 20
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Now I proceed to perform LDA upon this

rowTotals <- apply(dtm , 1, sum) 
dtm.new   <- dtm[rowTotals> 0, ]
g = LDA(dtm.new,10,method = 'VEM',control=NULL,model=NULL)

I get the following error

Error in LDA(dtm.new, 10, method = "VEM", control = NULL, model = NULL) : 
  The DocumentTermMatrix needs to have a term frequency weighting

The Document Term matrix was clearly weighted. What am I doing wrong ?

Kindly Help.

Original Q&A

There are 1 answers

**peterd** · Accepted Answer · 2015-06-11T08:57:20+00:00

peterd On 11 June 2015 at 08:57 BEST ANSWER

The Document Term matrix needs to have a term frequency weighting:

DocumentTermMatrix(corpus, 
                   control = list(tokenize = BigramTokenizer, 
                             weighting = weightTf))

TechQA.

LDA with tm package in R using bigrams

There are 1 answers

Related Questions in R

Related Questions in TEXT-MINING

Related Questions in TM

Related Questions in TF-IDF

Related Questions in LDA

Popular Questions

Trending Questions