How to intrepret Clusters results after using Doc2vec?

Question

How to intrepret Clusters results after using Doc2vec?

980 views Asked by pankaj jha At 28 August 2017 at 11:31

I am using doc2vec to convert the top 100 tweets of my followers in vector representation (say v1.....v100). After that I am using the vector representation to do the K-Means clusters.

model = Doc2Vec(documents=t, size=100, alpha=.035, window=10, workers=4, min_count=2)

I can see that cluster 0 is dominated by some values (say v10, v12, v23, ....). My question is what does these v10, v12 ... etc represents. Can I deduce that these specific column clusters specific keywords of document.

Original Q&A

There are 3 answers

**Has QUIT--Anony-Mousse** · Answer 1 · 2017-08-28T18:28:51+00:00

Don't use the individual variables. They should be only analyzed together because of the way these embeddings are trained.

For a starter, find

The most similar document vectors to your centroid to see typical cluster members
The most similar term vectors from the embedding for typical words to describe the cluster
Note the distances to see how good your fit is.

**Devaraj Phukan** · Answer 2 · 2017-08-28T12:28:21+00:00

The clusters themselves does not mean anything specific. You can have as many clusters as you want and all the clustering algorithm would do is try to distribute all your vectors among these clusters. If you are aware of all the tweets and know how many different topics you want them to be separated in, try to clean them or have features in them such that the clustering algorithm can use those to segregate them in the clusters of your choice.

Also if you meant topic modeling, that is different from clustering and you should also look that up.

**Gambit1614** · Answer 3 · 2017-08-28T12:34:09+00:00

These values represent the coordinates of the individual tweets (or documents) that you want to represent in a cluster. I am assuming that v1 to v100 represent the vectors for tweets 1 to 100, otherwise this won't make sense.So if suppose cluster 0 has v1,v5 and v6, this means that tweets 1, 5 and 6 with vector representation v1,v5 and v6 respectively (or the tweets with vectors v1, v5 and v6 as their representation) belong to the cluster 0.

TechQA.

How to intrepret Clusters results after using Doc2vec?

There are 3 answers

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in CLUSTER-ANALYSIS

Related Questions in GENSIM

Related Questions in DOC2VEC

Popular Questions

Popular Tags

Trending Questions