I've successfully ran mahout lda, and displayed the ouput using the command mahout ldatopics.
For example my topics are science and sports. then the output will be like: topic 0 basketball, play, baseball topic 1 research, study, philosophy
My question now is how can I, identify the the individual article's group or cluster. Is there an id number or some sort of tracking, so that for every new article that I add it will be grouped or added to a specific cluster/topic.
If I already have the cluster, what's the next step?
Thanks
I've been looking through the source code and I can't find any mention of a theta matrix for calculating the probability of topics given a document and since there's no input for an Alpha value to estimate the topics per document and the
LDAState
class has alogProbWordGivenTopic(int, int)
method but nothing likegetProbTopicGivenDocument()
I can only assume the mahout implementation of LDA doesn't deal with discovering the topic distribution for specific documents. I'd love to be wrong though if someone else knows better.