Clustering data using scipy and a distance matriz in Python

78 views Asked by At

I am working in Python. I am using a binary dataframe in which I have a ser of values of 0 and 1 for diferent users at diferent times.

I can perform hierarchical clustering directly from the dataframe as

    metodo='average'
    clusters = linkage(user_df, method=metodo,metric='hamming')
    
    # Create a dendrogram
    plt.figure(figsize=(10, 7))
    dendrogram(clusters, labels=user_df.index, leaf_rotation=90)
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('User')
    plt.ylabel('Distance')
# Save the figure
plt.savefig(f'dendrogram_{metodo}_entero.png')
plt.show()

However, I want to separate the calculation of the distance matrix and the clustering. To do that, I have calculated the distance matrix and I have sent it as an argument to the clustering.

dist_matrix = pdist(user_df.values, metric='hamming')

# Convert the distance matrix to a square form
dist_matrix_square = squareform(dist_matrix)

# Create a DataFrame from the distance matrix
dist_df = pd.DataFrame(dist_matrix_square, index=user_df.index, columns=user_df.index)

clusters = linkage(dist_df, method=metodo)

Unfortunately, the results that I obtain are different with both methodologies. As far as I know, the first code is the correct one.

So I don't know if I can calculate the distance matrix and then use it somehow as an argument for clustering.

2

There are 2 answers

2
Warren Weckesser On BEST ANSWER

pdist returns a numpy array that is the condensed distance matrix. You can pass this form of the distance matrix directly to linkage. Don't convert it to a Pandas DataFrame.

So your code could be as simple as:

dist_matrix = pdist(user_df.values, metric='hamming')
clusters = linkage(dist_matrix, method=metodo)
1
L Maxime On

There are multiple clustering algorithms, it is normal that different algorithms give different results. You can checkout scikit-learn clustering documentation to have an overview.

As for you question, here are three examples (that I have not tested), where distance_matrix is the distance matrix you computed:

Agglomerative Hierarchical Clustering (SciPy):

from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

# Create a linkage matrix from the distance matrix
linkage_matrix = linkage(distance_matrix, method='ward')

# Obtain cluster assignments
clusters = fcluster(linkage_matrix, t=threshold, criterion='distance')

K-Means Clustering (scikit-learn):

from sklearn.cluster import KMeans

# Specify the number of clusters (n_clusters)
kmeans = KMeans(n_clusters=num_clusters, random_state=seed)

# Fit the model to the distance matrix
kmeans.fit(distance_matrix)

# Obtain cluster assignments
clusters = kmeans.labels_

DBSCAN (scikit-learn):

from sklearn.cluster import DBSCAN

# Specify epsilon (eps) and minimum samples (min_samples)
dbscan = DBSCAN(eps=epsilon, min_samples=min_samples, metric='precomputed')

# Fit the model to the distance matrix
dbscan.fit(distance_matrix)

# Obtain cluster assignments (Note: -1 indicates noise/outliers)
clusters = dbscan.labels_

Edit:

According to scipy.cluster.hierarchy.linkage documentation, the array can be:

  • A condensed distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns.
  • Alternatively, a collection of observation vectors in dimensions may be passed as an m by n array.