Different silhouette scores for the same data and number of clusters

3.1k views Asked by At

I would like to choose an optimal number of clusters for my dataset using silhouette score. My data set are information about 2,000+ brands, including number of customers purchased this brand, sales for the brand and number of goods the brand sells under each category.

Since my data set is quite sparse, I've used MaxAbsScaler and TruncatedSVD before clustering.

The clustering method I use is k-means since I'm most familiar with this one (I would appreciate your suggestion on other clustering method).

When I set the cluster number to 80 and run k-means, I got different silhouette score each time. Is it because k-means gives different clusters each time? Sometimes silhouette score for a cluster number of 80 is less than 200 and sometimes it's the opposite. So I'm confused about how to choose a reasonable number of clusters.

Besides, the range of my silhouette score is quite small and doesn't change a lot as I increase the number of clusters, which ranges from 0.15 to 0.2.

Here is the result I got from running Silhouette score:

For n_clusters=80, The Silhouette Coefficient is 0.17329035592930178
For n_clusters=100, The Silhouette Coefficient is 0.16970208098407866
For n_clusters=200, The Silhouette Coefficient is 0.1961679920561574
For n_clusters=300, The Silhouette Coefficient is 0.19367019831221857
For n_clusters=400, The Silhouette Coefficient is 0.19818865972762675
For n_clusters=500, The Silhouette Coefficient is 0.19551544844885604
For n_clusters=600, The Silhouette Coefficient is 0.19611760638136203

I would much appreciate your suggestions! Thanks in advance!

2

There are 2 answers

0
Has QUIT--Anony-Mousse On

Yes, k-means is randomized, so it doesn't always give the same result.

Usually that means this k is NOT good.

But don't blindly rely on silhouette. It's not reliable enough to find the "best" k. Largely, because there usually is no best k at all.

Look at the data, and use your understanding to choose a good clustering instead. Don't expect anything good to come out automatically.

0
Mr K. On

I think you are using sklearn so setting the random_state parameter to a number should let you have reproducible results for different executions of k-means for the same k. You can set that number to 0, 42 or whatever you want just keep the same number for different runs of your code and the results will be the same.