Varying cluster labels in DBSCAN

2.9k views Asked by At

I am using DBSCAN from sklearn in python to cluster some data points. I am using a precomputed distance matrix to cluster the points.

import sklearn.cluster as cl
C = cl.DBSCAN(eps = 2, metric = 'precomputed', min_samples =2)
db =  C.fit(Dist_Matrix)

Dist_Matrix is precomputed distance matrix I am using. Each time when I run my code, I am getting different cluster labels for the data points. Number of clusters is also varying Like, in the first run,labels are

[ 2  3  3  0  3  0  2  2  2  4  2 -1  0  0  0  1  4  0  1  0  1  3  0  3  0
0  1 -1  0  3  1  3  0  0  2  0  2  0 -1  0  0  3  0  0  0  1  0  1  0  0]

in another run, it is like

[ 0  2  2  1  2  1  0  0  0  3  0 -1  1  1  1  0  3  1  0  1  0  2  1  2  1
1  0 -1  1  2  0  2  1  1  0  1  0  1 -1  1  1  2  1  1  1  0  1  0  1  1]

How can I resolve this? Please help

2

There are 2 answers

0
Has QUIT--Anony-Mousse On

Clustering will usually not assign the same labels.

Because the label itself is meaningless. The only valueable information is what objects go together.

As for sklearn, if you use an old version, it will (unnecessarily) randomly shuffle the data. So it's not surprising you get a random permutation of the labels.

Usually, if you require stable labels, you are doing something wrong!

Butif you really know you need that, implement a simple logic: sort clusters by their smallest object, and relabel them accordingly. I.e. the first objects cluster is cluster 0. The second objects cluster (unless it is the same) is cluater 1, and so forth.

0
Dammio On

You can use a custom function to normalize the cluster labels.

def normalize_cluster_labels(labels):
     min_value = min(labels)
     if (min_value < 0):
         labels = labels + abs(min(labels)) # normalize indexes
         #idx = clustering.labels_ - min(clustering.labels_ )
 
     return labels