How to find most similar terms/words of a document in doc2vec?

3.3k views Asked by At

I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton

2

There are 2 answers

0
TrnKh On BEST ANSWER

To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.

  • Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.

  • TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:

    • calculate IDF for every single word that appears in the documents based on the number of documents that contain that keyword
    • concatenate the text of the similar documents (I 'd call it a super-document) and then calculate TF for each word that appears in this super-document
    • calculate TF*IDF for every word... and then TA DAAA... you have your keywords associated with each cluster.

    Take a look at Section 5.1 here for more details on the use of TF-IDF.

0
gojomo On

@TrnKh's answer is good, but there is an additional option made available when using Doc2Vec.

Some gensim Doc2Vec training modes – either the default PV-DM (dm=1) or PV-DBOW with added word-training (dm=0, dbow_words=1) train both doc-vectors and word-vectors into the same coordinate space, and to some extent that means doc-vectors are near related word-vectors, and vice-versa.

So you could take an individual document's vector, or the average/centroid vectors you've synthesized, and feed it to the model to find most_similar() words. (To be clear that this is a raw vector, rather than a list of vector-keys, you should use the form of most_similar() that specifies an explicit list of positive examples.)

For example:

docvec = d2v_model.docvecs['doc77145']  # assuming such a doc-tag exists
similar_words = d2v_model.most_similar(positive=[docvec])
print(similar_words)