How we can check if TSNE results are real when we cluster data?

1.2k views Asked by At

I am apply TSNE for dimensionality reduction. I have several features that I reduce to 2 features. After, I use Kmeans to cluster the data. Finally, I use seaborn to plot the clustering results.

To import TSNE I use:

from sklearn.manifold import TSNE

To Apply TSNE I use :

features_tsne_32= TSNE(2).fit_transform(standarized_data)

After that I use Kmeans:

kmeans = KMeans(n_clusters=6, **kmeans_kwargs)
kmeans.fit(features_tsne_32)
km_tsne_32 = kmeans.predict(features_tsne_32)

Finally, I have the plot by using:

import seaborn as sns

#plot data with seaborn

facet = sns.lmplot(data=df, x='km_tsne_32_c1', y='km_tsne_32_c2', hue='km_tsne_32', 
                       fit_reg=False, legend=True, legend_out=True)

I have this plot:

enter image description here

This plot seems to be too perfect and globular it is something wrong with the procedure I follow to plot this data? in the code describe above?

3

There are 3 answers

2
kyriakosSt On BEST ANSWER

Your problem is not specific to t-SNE, but rather to any unsupervised learning algorithm. How do you evaluate its results?

I would say that the only proper way to do this is if you have some prior or expert knowledge on the data. Something like labels, other metadata, even user feedback.


That being said, regarding your specific plot:

  1. The fact that you get a continuous "pie" rather than some discrete structure like "islands" or "spaghetti" from tSNE is likely indicative that the projection is not very-well learned. Usually tSNE is supposed to create semi-distinct groups of similar datapoints. This shape looks like an over-leguralized model. (like a VAE with high KL-divergence coefficient).
  2. k-Means produces exactly the partitioning one would expect: The cluster assignment of k-means implicitly creates a Voronoi diagram over the feature space with the cells being the cluster centroids. And a good initialization would produce initial centroids spread out in the feature space. Since that space is symmetrical, then the centroids will probably be as well.

So k-Means is fine, but you probably need to tweak the parameters of t-SNE.

1
Gulzar On

Is something wrong with the procedure I follow?

Yes.

Using TSNE projects data onto another space, on which you have no real control.
Doing so is supposed to keep close points close, and far points far.

You then use KNN on the projected space to determine the groups.
This part loses any grouping information you previously had [citation needed, need to see what the data was beforehand]!

It would make much more sense to color the groups according to some prior labeled data, not according to KNN
-OR-
to use KNN on the original space for grouping, and then to color the projected space according to that grouping.

What you did in fact is meaningless, as it loses all prior information - labels and spatial.


To conclude:

  1. If you have labels, use them.
  2. If you don't, use a more sophisticated clustering algorithm, starting with KNN on the original space, as you can see KNN on the projected space is not enough.
0
James LI On

Check the perplexity of the t-SNE algorithm. t-SNE often produces disc-like blobs when the perplexity is too small. Also, test with DBSCAN clustering algorithm which work often better than k-Means.