Determine the document ID on Mahout LDA Output

738 views Asked by At

I've successfully ran mahout lda, and displayed the ouput using the command mahout ldatopics.

For example my topics are science and sports. then the output will be like: topic 0 basketball, play, baseball topic 1 research, study, philosophy

My question now is how can I, identify the the individual article's group or cluster. Is there an id number or some sort of tracking, so that for every new article that I add it will be grouped or added to a specific cluster/topic.

If I already have the cluster, what's the next step?

Thanks

1

There are 1 answers

1
Kevin On

I've been looking through the source code and I can't find any mention of a theta matrix for calculating the probability of topics given a document and since there's no input for an Alpha value to estimate the topics per document and the LDAState class has a logProbWordGivenTopic(int, int) method but nothing like getProbTopicGivenDocument() I can only assume the mahout implementation of LDA doesn't deal with discovering the topic distribution for specific documents. I'd love to be wrong though if someone else knows better.