Why is the term frequency displayed in my pyldavis visualization changing?

257 views Asked by At

I am currently building an LDA model using bigrams as tokens. I have a basic cleaning function that removes stopwords, lowercases, removes punctuation, and a function to take the tokenized column of documents and converts them to bigrams.

I am using Gensim for this, as the built-in corpus / dictionary methods make LDA pretty easy. Once I get the documents cleaned and converted to bigrams, I create the dictionary / corpus and use the built-in methods to view occurrences of a certain bigram.

I have another function that takes as input; the dictionary / corpus objects, the column of documents as a list of lists of bigrams, and an integer range. The function creates and saves an LDA model for each number in the integer range (number of topics), and a corresponding coherence score for the number of topics. I use the returned num_topics / con_scores lists to plot the coh_scores by number of topics; the goal being to find an optimal number of topics for the input documents.

The problem is that each time I run this function for the exact same data, or use pyLDAvis to view the topics of one of the saved models, the term frequencies change; As in, for one of my bigrams 'stored_procedure_', the built-in Gensim methods confirm every time that the number of occurrences for that bigram is 98, but the pyLDAvis visualization term frequency (blue bar on the right hand side of the vis that is supposed to represent the total number of times the term occurs in the corpus) changes, which doesn't make any sense as the corpus is the same and never changes. The term frequency also changes when I visualize different models saved by my function; I.e. model6 (the model created with 6 topics) has a different term frequency than model8 (the model with 8 topics). This doesn't add up to me as I am using the same corpus.

Please help. Why does the term frequency change. Screenshots below.

output of cleaning function

output of bigram function

Vis for model with 6 topics (note term frequency on right of 'stored_procedure_')

Vis for model with 10 topics (note differing term frequency for 'stored_procedure_'

0

There are 0 answers