I have a set of genes which have been aligned and clustered based on DNA sequences, and I have this set of genes in a Newick tree representation (https://en.wikipedia.org/wiki/Newick_format). Does anyone know how to convert this format to the scipy.cluster.hierarchy.linkage matrix format? From the scipy docs for the linkage matrix:
A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n+i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.
At least from the scipy docs, their description of how this linkage matrix is structured is rather confusing. What do they mean by "iteration"? Also, how does this representation keep track of which original observations are in which cluster?
I would like to figure out how to do this conversion as the results of other cluster analyses in my project have been done with the scipy representation, and I've been using it for plotting purposes.
I got how the linkage matrix is generated from the tree representation, thanks @cel for clarification. Let's take the example from the Newick wiki page (https://en.wikipedia.org/wiki/Newick_format)
The tree, in string format is:
First, one should compute the distances between all of the leaves. If for example, we wish to compute the distance A and B, the method is to traverse the tree from A to B through the nearest branch. Since in the Newick format, we are given the distance between each leaf and the branch, the distance from A to B is simply
0.1 + 0.2 = 0.3
. For A to D, we would have to do0.1 + (0.5 + 0.4) = 1.0
, since the distance from D to the nearest branch is given as 0.4, and the distance from D's branch to A's is 0.5. Thus the distance matrix looks like this (with indexingA=0
,B=1
,C=2
,D=3
):From here, the linkage matrix is easy to find. Since we already have
n=4
clusters (A
,B
,C
,D
) as original observations, we need to find the additionaln-1
clusters of the tree. Each step simply combines two clusters into a new one, and we take the two clusters that are closest to each other. In this case, A and B are closest together, so the first row of the linkage matrix will look like this:From now on, we treat A & B as one cluster whose distance to the nearest branch is the distance between A & B.
Now we have 3 clusters left,
AB
,C
, andD
. We can update the distance matrix to see which clusters are closest together. LetAB
have index0
in the updated distance matrix.We can now see that C and D are closest to each other, so let's combine them into a new cluster. The second row in the linkage matrix will now be
Now, we only have two clusters left,
AB
andCD
. The distance from these clusters to the root branch is 0.3 and 0.7 respectively, so their distance is 1.0. The final row of the linkage matrix will be:Now, the scipy matrix wouldn't actually have the strings in place as I've shown here, we would have the use the indexing scheme, since we combined A and B first,
AB
would have index 4 andCD
would have index 5. So the actual result we should see in the scipy linkage matrix would be:This is the general way to get from the tree representation to the scipy linkage matrix representation. However, there already exist tools from other python packages to read in trees in Newick format, and from these, we can fairly easily find the distance matrix, and then pass that to scipy's linkage function. Below is a little script that does exactly that for this example.