I am working with a set of species counts (counts) from several different sample stations (stations). I have calculated the Bray-Curtis similarity between every possible pair of sample stations using the pw_distance function from scikit-bio. This produces a distance matrix with values bounded between 0 and 1. So far so good.
I want to use that distance matrix to produce a dendrogram showing how the sample stations cluster together. I am doing this using scipy's hierachy.linkage function to find the linkages for the dendrogram, and then plotting with hierarchy.dendrogram.
Here's my code:
from skbio.diversity.beta import pw_distances
from scipy.cluster import hierarchy
bc_dm = pw_distances(counts, stations, metric = "braycurtis")
# use (1 - bc_dm) to get similarity rather than dissimilarity
sim = 1 - bc_dm.data
Z = hierarchy.linkage(sim, 'ward')
hierarchy.dendrogram(
Z,
leaf_rotation=0., # rotates the x axis labels
leaf_font_size=10., # font size for the x axis labels
labels=bc_dm.ids,
orientation="left"
)
here is a link to the dendrogram produced by the above code
As I understand it, the distance on the dendrogram should correspond to the Bray-Curtis similarity (analogous to a distance), but the distance values on my dendrogram reach a maximum of over 30. Is this correct? If not, how can I scale my distances to correspond to the Bray-Curtis similarity between sample stations? If it is correct, what do the distances on teh dendrogram really correspond to?
See the links shared in the comments as they address your questions.
One
scikit-biostep that isn't covered in those links is that you should call linkage onbc_dm.condensed_form(), rather than onbc_dmorsim. This will get you the input in the format that you need. If you pass a 2D matrix,linkageassumes that it's yourcountsmatrix, and is computing Euclidean distances between your samples based on those data.Also, be sure to pay attention to the
methodparameter toscipy.cluster.hierarchy.linkageas that will impact the interpretation of the branch lengths in your dendrogram. The doc string forscipy.cluster.hierarchy.linkagecontains details on how these are computed for the different methods.