Hierarchical LDA eats up all available memory and never finishes

428 views Asked by At

I am waiting for my membership on the mailing list to be confirmed, so I thought I would ask it here to maybe speed up the things a little bit.

I am writing my master's thesis on topic modeling and use Mallet implementations of LDA and HLDA.

I work on a corpus of over 4m documents. While LDA (ParallelTopicModel) handles the dataset decently, and I don't encounter any issues with that, HLDA is unable to go farther then let's say 5-6 iterations before filling up all the available memory (I even ran the program with 90g of RAM). On smaller datasets (10-20k documents) it works.

That's how I train the model:

HierarchicalLDA hierarchicalLDAModel = new HierarchicalLDA();
hierarchicalLDAModel.initialize(trainInstances, testInstances, numLevels, new Randoms());
hierarchicalLDAModel.estimate(numIterations);

I'd gladly provide any other information you might need for troubleshooting, just comment and let me know.

Thank you very much in advance!

1

There are 1 answers

2
David Mimno On

hLDA is a non-parametric model, which means that the number of parameters expands with the data size. There's currently no way to apply a maximum number of topics. You can most effectively limit the number of topics by increasing the topic-word smoothing parameter eta (NOT the CRP parameters). If this parameter is small, the model will prefer to create a new topic rather than add a low-probability word to an existing topic.