Using online LDA to predict on test data

879 views Asked by At

I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.

I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ). Please help to resolve my confusion.

2

There are 2 answers

0
Atendra Gautam On

Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.

Once you have test corpus use LDA to find document- topic distribution. Hope this helps.

0
Dan D. On

In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).

All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.

You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.