When training a topic model with topicmodels package using Gibbs sampling the scores using posterior() differ from the scores in the trained model itself. When I use VEM instead of Gibbs the scores are exactly the same. Why is that? Since using the same data for training as for posterior score calculation I would expect Gibbs sampling to produce the same scores as well.

I use data from the package to show what I mean.

library(topicmodels)

data("AssociatedPress", package = "topicmodels")
lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2)
# calculate posterior scores with the exact same data
scoresVEM <- as.data.frame( posterior(lda, AssociatedPress[1:20,])$topics)
# retrieve scores from the model itself
scoresVEM2 <- as.data.frame(cbind([email protected],[email protected]))

lda <- LDA(AssociatedPress[1:20,], method = "Gibbs", control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5), k = 2)
# calculate posterior scores with the exact same data
scoresGibbs <- as.data.frame( posterior(lda, AssociatedPress[1:20,])$topics)
# retrieve scores from the model itself
scoresGibbs2 <- as.data.frame(cbind([email protected],[email protected]))

Let's compair the data frames:

head(scoresVEM)
             1            2
1 9.998838e-01 0.0001162225
2 9.998872e-01 0.0001127924
3 9.998801e-01 0.0001198678
4 9.998523e-01 0.0001476551
5 4.628080e-04 0.9995371920
6 7.261089e-05 0.9999273891

head(scoresVEM2)
            V1           V2
1 9.998838e-01 0.0001162225
2 9.998872e-01 0.0001127924
3 9.998801e-01 0.0001198678
4 9.998523e-01 0.0001476551
5 4.628080e-04 0.9995371920
6 7.261089e-05 0.9999273891

So both model scores and posterior scores are the same. Here's how Gibbs does that:

head(scoresGibbs)
          1         2
1 0.1629393 0.8370607
2 0.2990654 0.7009346
3 0.1639344 0.8360656
4 0.5214008 0.4785992
5 0.5086207 0.4913793
6 0.1889597 0.8110403

head(scoresGibbs2)
         V1        V2
1 0.1693291 0.8306709
2 0.3021807 0.6978193
3 0.1901639 0.8098361
4 0.5408560 0.4591440
5 0.5517241 0.4482759
6 0.2080679 0.7919321

The scores are, at least slightly, different. Why is that? Many thanks in advance!

0 Answers