How to find most similar terms/words of a document in doc2vec?

Question

How to find most similar terms/words of a document in doc2vec?

3.3k views Asked by pankaj jha At 05 September 2017 at 05:23

I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton

Original Q&A

There are 2 answers

gojomo On 05 September 2017 at 17:41

@TrnKh's answer is good, but there is an additional option made available when using Doc2Vec.

Some gensim Doc2Vec training modes – either the default PV-DM (dm=1) or PV-DBOW with added word-training (dm=0, dbow_words=1) train both doc-vectors and word-vectors into the same coordinate space, and to some extent that means doc-vectors are near related word-vectors, and vice-versa.

So you could take an individual document's vector, or the average/centroid vectors you've synthesized, and feed it to the model to find most_similar() words. (To be clear that this is a raw vector, rather than a list of vector-keys, you should use the form of most_similar() that specifies an explicit list of positive examples.)

For example:

docvec = d2v_model.docvecs['doc77145']  # assuming such a doc-tag exists
similar_words = d2v_model.most_similar(positive=[docvec])
print(similar_words)

**TrnKh** · Accepted Answer · 2017-09-05T16:14:15+00:00

To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.

Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.
TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:
- calculate IDF for every single word that appears in the documents based on the number of documents that contain that keyword
- concatenate the text of the similar documents (I 'd call it a super-document) and then calculate TF for each word that appears in this super-document
- calculate TF*IDF for every word... and then TA DAAA... you have your keywords associated with each cluster.
Take a look at Section 5.1 here for more details on the use of TF-IDF.

TechQA.

How to find most similar terms/words of a document in doc2vec?

There are 2 answers

Related Questions in PYTHON

Related Questions in CLUSTER-ANALYSIS

Related Questions in GENSIM

Related Questions in WORD2VEC

Related Questions in DOC2VEC

Popular Questions

Popular Tags

Trending Questions