Looking for ways of clustering data acording to distance

167 views Asked by At

I have a pandas dataframe defined as

               Alejandro Ana  Beatriz    Jose    Juan     Luz   Maria   Ruben                                                          
Alejandro        0.0     0.0   1000.0     0.0  1037.0  1014.0   100.0     0.0
Ana              0.0     0.0     15.0     0.0   100.0     0.0    16.0  1100.0
Beatriz       1000.0    15.0      0.0   100.0  1000.0  1100.0    15.0     0.0
Jose             0.0     0.0    100.0     0.0     0.0   100.0  1000.0    14.0
Juan          1037.0   100.0   1000.0     0.0     0.0  1014.0     0.0   100.0
Luz           1014.0     0.0   1100.0   100.0  1014.0     0.0     0.0     0.0
Maria          100.0    16.0     15.0  1000.0     0.0     0.0     0.0     0.0
Ruben            0.0  1100.0      0.0    14.0   100.0     0.0     0.0     0.0

This dataframe contains compatibility measurements I want to group these people in groups of two or three people. This could be [Alejandro, Ana, Beatriz], [Jose, Juan, Luz], [Maria, Ruben].

To do that, I have to maximize the compatibility inside each group.

Can someone please point me towards a methodology?

1

There are 1 answers

4
Muhammed Yunus On BEST ANSWER

It looks like you are starting off with a distance matrix rather than the original sample values. AgglomerativeClustering can work with the distance matrix to group the samples into however many clusters you specify (other algorithms directly accepting a precomputed distance matrix include DBSCAN, HDBSCAN, SpectralClustering, and OPTICS).

In the code below, I ran AgglomerativeClustering on the data to get the cluster assigned to each name. Then, for visualisation, I represented the original distance matrix in 2D, and coloured the points by their assigned cluster.

enter image description here

The data:

import pandas as pd
import matplotlib.pyplot as plt

#Data: distance matrix
data = {
    'Alejandro': [0.0, 0.0, 1000.0, 0.0, 1037.0, 1014.0, 100.0, 0.0],
    'Ana': [0.0, 0.0, 15.0, 0.0, 100.0, 0.0, 16.0, 1100.0],
    'Beatriz': [1000.0, 15.0, 0.0, 100.0, 1000.0, 1100.0, 15.0, 0.0],
    'Jose': [0.0, 0.0, 100.0, 0.0, 0.0, 100.0, 1000.0, 14.0],
    'Juan': [1037.0, 100.0, 1000.0, 0.0, 0.0, 1014.0, 0.0, 100.0],
    'Luz': [1014.0, 0.0, 1100.0, 100.0, 1014.0, 0.0, 0.0, 0.0],
    'Maria': [100.0, 16.0, 15.0, 1000.0, 0.0, 0.0, 0.0, 0.0],
    'Ruben': [0.0, 1100.0, 0.0, 14.0, 100.0, 0.0, 0.0, 0.0]
}

df = pd.DataFrame(
    data,
    index=['Alejandro', 'Ana', 'Beatriz', 'Jose', 'Juan', 'Luz', 'Maria', 'Ruben']
)

df

Perform the clustering:

#
#Cluster based on distances
#
from sklearn.cluster import AgglomerativeClustering

n_clusters = 3
clusterer = AgglomerativeClustering(
    n_clusters=n_clusters, metric='precomputed', linkage='average'
)
label_foreach_name = clusterer.fit(df).labels_

#View results as a table
cluster_info = pd.DataFrame(
    {'name': df.columns,
     'cluster': label_foreach_name}
).set_index('cluster').sort_index()

display(
    cluster_info
)

Visualise in a 2D coordinate space:

#
# Visualisation
# Represent in 2D, coloured by cluster
#

#Invert distance matrix to 2D coordinate space
from sklearn.manifold import MDS

mds = MDS(n_components=2, dissimilarity='precomputed', random_state=0)
coords_2d = mds.fit_transform(df)

#Plot points, coloured by assigned cluster
import matplotlib.pyplot as plt
cluster_cmap = plt.get_cmap('Set1', n_clusters)

f, ax = plt.subplots(figsize=(4, 4))
ax.scatter(coords_2d[:, 0], coords_2d[:, 1],
           c=cluster_cmap(label_foreach_name),
           marker='s', s=100)
ax.set_xlabel('x')
ax.set_ylabel('y')
f.suptitle('Inversion of the distance matrix to 2D', fontsize=11)
ax.set_title('Samples coloured by cluster', fontsize=10)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Add text annotations
for row_num, name in enumerate(df.columns):
    ax.annotate(name, xy=coords_2d[row_num], xytext=(10, -5), textcoords='offset pixels')

#Add legend
cluster_colours = plt.get_cmap
for colour, label in zip(cluster_cmap(range(n_clusters)),
                         [f'cluster {i}' for i in range(n_clusters)]):
    ax.scatter([], [], marker='s', s=50, c=colour, label=label)
f.legend(bbox_to_anchor=(1.3, 1))

Updated solution OP requires that cluster sizes are limited to 2 or 3, i.e. bounded between user-defined values. Initially I tried HDBSCAN as it accepts a min and max specification for cluster sizes, but it failed with this small dataset (more info at the bottom).

My attempt below runs lots of KMeans trials to find a clustering that is suitable. It stops when it finds a clustering that contains no bad clusters (a bad cluster is where the size doesn't meet the user-defined spec). A downside of this approach is that the quality of the clustering might be poor or variable as we are relying on random initialisations of KMeans.

enter image description here

from sklearn.cluster import AgglomerativeClustering, KMeans
import numpy as np

#Specify the constraints
min_cluster_size = 2
max_cluster_size = 3
max_tries = 400

#Use constraints to determine acceptable cluster sizes
acceptable_cluster_sizes = np.arange(min_cluster_size, max_cluster_size + 1)

nclusters_to_try = np.arange(2, len(df) + 1)
# nclusters_to_try = np.arange(len(df) // min_cluster_size, 2, step=-1)

rand_state = np.random.RandomState(0)

labels = None
for n_clusters in nclusters_to_try:
    if labels is not None:
        break
    
    for attempt in range(max_tries):
        clusterer = KMeans(
            n_clusters=n_clusters,
            init='random',
            n_init=1,
            max_iter=300,
            random_state=rand_state
        ).fit(df)
        
        clusters, counts = np.unique(clusterer.labels_, return_counts=True)
        n_good_clusters = np.isin(counts, acceptable_cluster_sizes).sum()
        n_bad_clusters = n_clusters - n_good_clusters
        
        print(
            f'n_clusters: {n_clusters:>2d}',
            f'  attempt {attempt + 1:>4d}/{max_tries:>4d}',
            f'{n_bad_clusters:>2d} bad clusters',
            end='\r' if attempt + 1 < max_tries else '\n'
        )
        
        if n_bad_clusters == 0:
            labels = clusterer.labels_
            break

if labels is None:
    raise TimeoutError(f'No solution found when max_tries={max_tries}')

#View results as a table
cluster_info = pd.DataFrame(
    {'name': df.columns,
     'cluster': labels}
).set_index('cluster').sort_index()

display(
    cluster_info
)


Failed, but worth trying on more data: Initially I tried HDBSCAN as it takes both a min_cluster_size= and max_cluster_size= parameter. However, it was flagging all the samples as anomalies. You might have better luck with a larger dataset:

#Using HDBSCAN
from sklearn.cluster import HDBSCAN

min_cluster_size = 2
max_cluster_size = 3

clusterer = HDBSCAN(
    min_cluster_size=min_cluster_size,
    max_cluster_size=max_cluster_size,
    metric='precomputed',
)

label_foreach_name = clusterer.fit(df).labels_

#View results as a table
cluster_info = pd.DataFrame(
    {'name': df.columns,
     'cluster': label_foreach_name}
).set_index('cluster').sort_index()

cluster_info