Why the height of dendrogram based on correlation matrix is higher than 1?

53 views Asked by At

I am new to python. While I am removing multicollinear featrues by calculating the correlation matrix between different features in a same way as https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#permutation-importance-with-multicollinear-or-correlated-features with a more specific code is presented as below:

from scipy.stats import spearmanr
from scipy.spatial.distance import squareform
from collections import defaultdict
from scipy.cluster.hierarchy import linkage, fcluster,distance

corr = spearmanr(data_all[feature_name]).correlation

# Ensure the correlation matrix is symmetric
corr = (corr + corr.T) / 2
np.fill_diagonal(corr, 1)

# We convert the correlation matrix to a distance matrix before performing
# hierarchical clustering using Ward's linkage.
distance_matrix = 1 - np.abs(corr)
y = distance.squareform(distance_matrix)
dist_linkage = linkage(y,method='ward')

I got a result as below:

enter image description here

The height of the dendrogram is higher than 1, the maximum value of y. Is this result reasonable? If so, how is the height defined in hierarchy.linkage?

I did double check to make sure that the maximum value of y is smaller than 1. As I understand that whether methods for linkage, that is minimum, maximum, average, or square error ("ward"), the distance based on input y shall also be smaller than 1. Besides, the results of the manual of sklearn https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#permutation-importance-with-multicollinear-or-correlated-features also gives a result higher than 1. I am pretty confused on the results.

0

There are 0 answers