Is it possible to find the posterior probability of topics generated with LDAvis occurring in a given document? How, if so?

Question

Is it possible to find the posterior probability of topics generated with LDAvis occurring in a given document? How, if so?

662 views Asked by Gazzer At 28 December 2016 at 10:27

As may or may not be evident from the question, I'm pretty new to R and I could do with a bit of help on this.

When creating topic models, I've experimented with LDA and LDAvis - code in (A) and (B) below. LDA in (A) allows me to find the posterior probability of the topics occurring in each document within my corpus, which I have used to run regressions with variables from other datasets. (B), the topic generation approach using LDAvis, generates 'better', more coherent topics than through (A), but I haven't been able to work out how to find the posterior probabilities of the topics occurring in a given document with the LDAvis approach, or whether to discount this as an impossible task.

All advice greatly appreciated.

Thank you!

(A)

set.seed(1)
require(topicmodels)
set.seed(1)
P5LDA4 <- LDA(P592dfm, control=list(seed=1), k = 23)
set.seed(1)
terms(P5LDA4, k =30)

#find posterior probability
postTopics <- data.frame(posterior(P5LDA4)$topics)
postTopics

(B)

# MCMC and model tuning parameters:
K <- 23
G <- 5000
alpha <- 0.02
eta <- 0.02
# convert to lda format
dfmlda <- convert(newdfm, to = "lda")
# fit the model
library(lda)
set.seed(1)
t1 <- Sys.time()
fit <- lda.collapsed.gibbs.sampler(documents = dfmlda$documents, K = K, 
                               vocab = dfmlda$vocab, 
                               num.iterations = G, alpha = alpha, 
                               eta = eta, initial = NULL, burnin = 0,
                               compute.log.likelihood = TRUE)
t2 <- Sys.time()
t2 - t1
#Time difference of 3.13337 mins
save(fit, file = "./fit.RData")
load("./fit.RData")
library(LDAvis)
set.seed(1)
json <- createJSON(phi = t(apply(t(fit$topics) + eta, 2, function(x) x/sum(x))), 
               theta = t(apply(fit$document_sums + alpha, 2, function(x) x/sum(x))), 
               doc.length = ntoken(newdfm), 
               vocab = features(newdfm), 
               term.frequency = colSums(newdfm))
serVis(json, out.dir = "./visColl", open.browser = TRUE)

Original Q&A

There are 1 answers

**Franzi** · Answer 1 · 2017-09-05T21:40:21+00:00

Franzi On 05 September 2017 at 21:40

In your code B, you already calculate the posterior when creating the json.

theta: a D×K matrix is the posterior of the document-topic distribution.
phi: a K×W matrix is the posterior of the topic-term distribution.

Hope that helps!

TechQA.

Is it possible to find the posterior probability of topics generated with LDAvis occurring in a given document? How, if so?

There are 1 answers

Related Questions in R

Related Questions in PROBABILITY

Related Questions in LDA

Related Questions in TEXT-ANALYSIS

Related Questions in TOPICMODELS

Popular Questions

Popular Tags

Trending Questions