I have some texts and I'm using sklearn LatentDirichletAllocation algorithm to extract the topics from the texts.
I already have the texts converted into sequences using Keras and I'm doing this:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation()
X_topics = lda.fit_transform(X)
X
:
print(X)
# array([[0, 988, 233, 21, 42, 5436, ...],
[0, 43, 6526, 21, 566, 762, 12, ...]])
X_topics
:
print(X_topics)
# array([[1.24143852e-05, 1.23983890e-05, 1.24238815e-05, 2.08399432e-01,
7.91563331e-01],
[5.64976371e-01, 1.33304549e-05, 5.60003133e-03, 1.06638803e-01,
3.22771464e-01]])
My question is, what is exactly what's being returned from fit_transform
, I know that should be the main topics detected from the texts but I cannot map those numbers to an index so I'm not able to see what those sequences means, I failed at searching for an explanation of what is actually happening, so any suggestion will be much appreciated.
First, a general explanation - think of LDiA as a clustering algorithm, that's going to determine, by default, 10 centroids, based on the frequencies of words in the texts, and it's going to put greater weights on some of those words than others by virtue of proximity to the centroid. Each centroid represents a 'topic' in this context, where the topic is unnamed, but can be sort of described by the words that are most dominant in forming each cluster.
So generally what you're doing with LDA is:
or
For the second scenario, your expectation is that LDiA will output the "score" of the new text for each of the 10 clusters/topics. The index of the highest score is the index of the cluster/topic to which that new text belongs.
I prefer gensim.models.LdaMulticore, but since you've used the sklearn.decomposition.LatentDirichletAllocation I'll use that.
Here's some sample code (drawn from here) that runs through this process
Now let's try a new text:
Seems sensible - the sample text is about education and the word cloud for the first topic is about education.
The pictures below are from another dataset (ham vs spam SMS messages, so only two possible topics) which I reduced to 3 dimensions with PCA, but in case a picture helps, these two (same data from different angles) might give a general sense of what's going on with LDiA. (graphs are from Latent Discriminant Analysis vs LDiA, but the representation is still relevant)
While LDiA is an unsupervised method, to actually use it in a business context you'll likely want to at least manually intervene to give the topics names that are meaningful to your context. e.g. Assigning a subject area to stories on a news aggregation site, choosing amongst ['Business', 'Sports', 'Entertainment', etc]
For further study, perhaps run through something like this: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24