silhouette value increasing while the number of clusters increasing

4.3k views Asked by At

I have a matrix which the row are the brands and the columns are the features of each brand.

First, I calculate the affinity matrix with scikit learn and then apply the spectral clustering on the affinity matrix to do the clustering.

When I calculate the silhouette value with respect to each number of clusters, as long as the number of clusters increasing, the silhouette value is also increasing. In the end when the number of clusters get bigger and bigger, to calculate the silhouette value, it will give NaN result

#coding utf-8
import pandas as pd

import sklearn.cluster as sk
from sklearn.cluster import SpectralClustering
from sklearn.metrics import silhouette_score


data_event = pd.DataFrame.from_csv('\Data\data_of_events.csv', header=0,index_col=0, parse_dates=True, encoding=None, tupleize_cols=False, infer_datetime_format=False)

data_event_matrix = data_event.as_matrix(columns = ['Furniture','Food & Drinks','Technology','Architecture','Show','Fashion','Travel','Art','Graphics','Product Design'])

#compute the affinity matrix

data_event_affinitymatrix = SpectralClustering().fit(data_event_matrix).affinity_matrix_

#clustering
for n_clusters in range(2,100,2):
    print n_clusters
    labels = sk.spectral_clustering(data_event_affinitymatrix, n_clusters=n_clusters, n_components=None,
                        eigen_solver=None, random_state=None, n_init=10, eigen_tol=0.0, assign_labels='kmeans')

    silhouette_avg = silhouette_score(data_event_affinitymatrix, labels)
    print("For n_clusters =", n_clusters, "The average silhouette_score of event clustering is :", silhouette_avg)
1

There are 1 answers

2
Gambit1614 On

If your intention is to find the optimal number of cluster then you can try using the Elbow method. Multiple variations exists for this method, but the main idea is that for different values of K (no. of clusters) you find the cost function that is most appropriate for you application (Example, Sum of Squared distance of all the points in a cluster to it's centroid for all values of K say 1 to 8, or any other error/cost/variance function. In your case if it is a distance function, then after a certain point number of clusters, you will notice that the difference in values along the y-axis becomes negligible. Based on the graph plotted for number of clusters along x-axis and your metric along y-axis, you choose the value 'k' on x-axis at such a point where the value at y-axis changes abruptly.

You can see in this Example of Elbow method, that the optimal value of 'K' is 4.
Image Source : Wikipedia.

Another measure that you can use to validate your clusters is V-measure Score. It is a symmetric measure and if often used when the ground truth is unknown. It is defined as the Harmonic mean of Homogenity and Completeness. Here is an example in scikit-learn for your reference.

EDIT: V-measure is basically used to compare two different cluster assignments to each other.

Finally, if you are interested, you can take a look at Normalized Mutual Information Score to validate your results as well.

References :

Update : I recently came across this Self Tuning Spectral Clustering. You can give it a try.