Spatial Clustering in Pandas DataFrame: Ensuring Diversity within Clusters

19 views Asked by At

I have a pandas dataframe. The columns latitude, longitude and "floor" represent the spatial coordinates of people.

My data

import pandas as pd

data = {
    "latitude": [49.5659508, 49.568089, 49.5686342, 49.5687609, 49.5695834, 49.5706579, 49.5711228, 49.5716422, 49.5717749, 49.5619579, 49.5619579, 49.5628938, 49.5628938, 49.5630028, 49.5633175, 49.56397639999999, 49.566359, 49.56643220000001, 49.56643220000001, 49.5672061, 49.567729, 49.5677449, 49.5679685, 49.5679685, 49.5688543, 49.5690616, 49.5713705],
    "longitude": [10.9873409, 10.9894035, 10.9896749, 10.9887881, 10.9851579, 10.9853273, 10.9912959, 10.9910182, 10.9867083, 10.9995758, 10.9995758, 11.000319, 11.000319, 10.9990996, 10.9993819, 11.004145, 11.0003023, 10.9999593, 10.9999593, 10.9935709, 11.0011213, 10.9954016, 10.9982288, 10.9982288, 10.9975928, 10.9931367, 10.9939141],
    'floor': [1,2,3, 1, 4, 2, 1, 2,3, 6, 6, 2, 2, 3, 2 ,2 ,4, 2, 2, 3, 2, 2, 2, 1, 1, 3, 2 ],
}

df = pd.DataFrame(data)

I aim to cluster individuals residing in close proximity into clusters. Each cluster should consist of precisely 9 people, a critical requirement. To achieve this, I'm applying a function called get_even_clusters(), which relies on the KMeans algorithm.

Clustering my data

import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment

def get_even_clusters(X, cluster_size):
    n_clusters = int(np.ceil(len(X)/cluster_size))
    kmeans = KMeans(n_clusters)
    kmeans.fit(X)
    centers = kmeans.cluster_centers_
    centers = centers.reshape(-1, 1, X.shape[-1]).repeat(cluster_size, 1).reshape(-1, X.shape[-1])
    distance_matrix = cdist(X, centers)
    clusters = linear_sum_assignment(distance_matrix)[1]//cluster_size
    return clusters

I can easily apply this function and get the clusters:

# Combine latitude and longitude into a single array
X = np.column_stack((df['latitude'], df['longitude'], df['floor']))

# Apply clustering
clusters = get_even_clusters(X, cluster_size=9)

# Add clusters to the dataframe
df['cluster'] = clusters

df.sort_values(by=["cluster", "latitude", "longitude" , "floor"])

My problem:

How can I prevent individuals who share the same values in latitude, longitude, and floor from being clustered together? Ideally, I aim for them to be distributed across different clusters.

0

There are 0 answers