Newick tree representation to scipy.cluster.hierarchy linkage matrix format

3.2k views Asked by At

I have a set of genes which have been aligned and clustered based on DNA sequences, and I have this set of genes in a Newick tree representation (https://en.wikipedia.org/wiki/Newick_format). Does anyone know how to convert this format to the scipy.cluster.hierarchy.linkage matrix format? From the scipy docs for the linkage matrix:

A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n+i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

At least from the scipy docs, their description of how this linkage matrix is structured is rather confusing. What do they mean by "iteration"? Also, how does this representation keep track of which original observations are in which cluster?

I would like to figure out how to do this conversion as the results of other cluster analyses in my project have been done with the scipy representation, and I've been using it for plotting purposes.

3

There are 3 answers

4
themantalope On BEST ANSWER

I got how the linkage matrix is generated from the tree representation, thanks @cel for clarification. Let's take the example from the Newick wiki page (https://en.wikipedia.org/wiki/Newick_format)

The tree, in string format is:

(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

First, one should compute the distances between all of the leaves. If for example, we wish to compute the distance A and B, the method is to traverse the tree from A to B through the nearest branch. Since in the Newick format, we are given the distance between each leaf and the branch, the distance from A to B is simply 0.1 + 0.2 = 0.3. For A to D, we would have to do 0.1 + (0.5 + 0.4) = 1.0, since the distance from D to the nearest branch is given as 0.4, and the distance from D's branch to A's is 0.5. Thus the distance matrix looks like this (with indexing A=0, B=1, C=2, D=3):

distance_matrix=
 [[0.0, 0.3, 0.9, 1.0],
  [0.3, 0.0, 1.0, 1.1],
  [0.9, 1.0, 0.0, 0.7],
  [1.0, 1.1, 0.1, 0.0]]

From here, the linkage matrix is easy to find. Since we already have n=4 clusters (A,B,C,D) as original observations, we need to find the additional n-1 clusters of the tree. Each step simply combines two clusters into a new one, and we take the two clusters that are closest to each other. In this case, A and B are closest together, so the first row of the linkage matrix will look like this:

[A,B,0.3,2]

From now on, we treat A & B as one cluster whose distance to the nearest branch is the distance between A & B.

Now we have 3 clusters left, AB, C, and D. We can update the distance matrix to see which clusters are closest together. Let AB have index 0 in the updated distance matrix.

distance_matrix=
[[0.0, 1.1, 1.2],
 [1.1, 0.0, 0.7],
 [1.2, 0.7, 0.0]]

We can now see that C and D are closest to each other, so let's combine them into a new cluster. The second row in the linkage matrix will now be

[C,D,0.7,2]

Now, we only have two clusters left, AB and CD. The distance from these clusters to the root branch is 0.3 and 0.7 respectively, so their distance is 1.0. The final row of the linkage matrix will be:

[AB,CD,1.0,4]

Now, the scipy matrix wouldn't actually have the strings in place as I've shown here, we would have the use the indexing scheme, since we combined A and B first, AB would have index 4 and CD would have index 5. So the actual result we should see in the scipy linkage matrix would be:

[[0,1,0.3,2],
 [2,3,0.7,2],
 [4,5,1.0,4]]

This is the general way to get from the tree representation to the scipy linkage matrix representation. However, there already exist tools from other python packages to read in trees in Newick format, and from these, we can fairly easily find the distance matrix, and then pass that to scipy's linkage function. Below is a little script that does exactly that for this example.

from ete2 import ClusterTree, TreeStyle
import scipy.cluster.hierarchy as sch
import scipy.spatial.distance
import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations


tree = ClusterTree('(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);')
leaves = tree.get_leaf_names()
ts = TreeStyle()
ts.show_leaf_name=True
ts.show_branch_length=True
ts.show_branch_support=True

idx_dict = {'A':0,'B':1,'C':2,'D':3}
idx_labels = [idx_dict.keys()[idx_dict.values().index(i)] for i in range(0, len(idx_dict))]

#just going through the construction in my head, this is what we should get in the end
my_link = [[0,1,0.3,2],
        [2,3,0.7,2],
        [4,5,1.0,4]]

my_link = np.array(my_link)


dmat = np.zeros((4,4))

for l1,l2 in combinations(leaves,2):
    d = tree.get_distance(l1,l2)
    dmat[idx_dict[l1],idx_dict[l2]] = dmat[idx_dict[l2],idx_dict[l1]] = d

print 'Distance:'
print dmat


schlink = sch.linkage(scipy.spatial.distance.squareform(dmat),method='average',metric='euclidean')

print 'Linkage from scipy:'
print schlink

print 'My link:'
print my_link

print 'Did it right?: ', schlink == my_link

dendro = sch.dendrogram(my_link,labels=idx_labels)
plt.show()

tree.show(tree_style=ts)
2
MrTomRod On

I found this solution:

import numpy as np
import pandas as pd
from ete3 import ClusterTree
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage
import logging


def newick_to_linkage(newick: str, label_order: [str] = None) -> (np.ndarray, [str]):
    """
    Convert newick tree into scipy linkage matrix

    :param newick: newick string, e.g. '(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);'
    :param label_order: list of labels, e.g. ['A', 'B', 'C']
    :returns: linkage matrix and list of labels
    """
    # newick string -> cophenetic_matrix
    tree = ClusterTree(newick)
    cophenetic_matrix, newick_labels = tree.cophenetic_matrix()
    cophenetic_matrix = pd.DataFrame(cophenetic_matrix, columns=newick_labels, index=newick_labels)

    if label_order is not None:
        # sanity checks
        missing_labels = set(label_order).difference(set(newick_labels))
        superfluous_labels = set(newick_labels).difference(set(label_order))
        assert len(missing_labels) == 0, f'Some labels are not in the newick string: {missing_labels}'
        if len(superfluous_labels) > 0:
            logging.warning(f'Newick string contains unused labels: {superfluous_labels}')

        # reorder the cophenetic_matrix
        cophenetic_matrix = cophenetic_matrix.reindex(index=label_order, columns=label_order)

    # reduce square distance matrix to condensed distance matrices
    pairwise_distances = squareform(cophenetic_matrix)

    # return linkage matrix and labels
    return linkage(pairwise_distances), list(cophenetic_matrix.columns)

Basic usage:

>>> linkage_matrix, labels = newick_to_linkage(
...     newick='(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);'
... )
>>> print(linkage_matrix)
[[0.  1.  0.3 2. ]
 [2.  3.  0.7 2. ]
 [4.  5.  0.9 4. ]]
>>> print(labels)
['A', 'B', 'C', 'D']

What the cophenetic matrix looks like:

>>> print(cophenetic_matrix)
     A    B    C    D
A  0.0  0.3  0.9  1.0
B  0.3  0.0  1.0  1.1
C  0.9  1.0  0.0  0.7
D  1.0  1.1  0.7  0.0

Advanced usage:

>>> linkage_matrix, labels = newick_to_linkage(
...     newick='(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);',
...     label_order=['C', 'B', 'A']
... )
WARNING:root:Newick string contains unused labels: {'D'}
>>> print(linkage_matrix)
[[1.  2.  0.3 2. ]
 [0.  3.  0.9 3. ]]
>>> print(labels)
['C', 'B', 'A']
0
gbinux On

Here is my solution to this question. The main difference from the above answers is that it works directly from the Newick tree, and circumvents building distance matrix and clustering based on the distance matrix. The idea comes from

  1. linkage matrix definition from scipy link
  2. an example from scikit-learn documentation link

The function

from Bio import Phylo
import numpy as np
import io

def newick_string_to_linkage_matrix(newick_str):
    # definition of linkage matrix: 
    # https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
    
    # create tree object via Biopython (Bio.Phylo)
    tree = next(Phylo.parse(io.StringIO(newick_str), format="newick"))
    # calculate tree height
    tree_height = max(tree.distance(c) for c in tree.find_clades(terminal=True))
    # add comment with id and span of terminal nodes:
    id_map = {}
    for i, c in enumerate(tree.find_clades(terminal=True)):
        c.comment = (i, 1) 
        id_map = {i: c.name}
    # ancestor list orderred by distance from the root
    anc_lst = []
    for c in tree.find_clades(terminal=False):
        d = tree.distance(c)
        anc_lst.append((c, list(c), d))
    anc_lst.sort(key=lambda x:x[2], reverse=True)
    # running number of node
    nodes = len(list(tree.find_clades(terminal=True)))
    lnk_lst = []
    for anc,children, anc_d in anc_lst:
        n_children = len(children)
        assert n_children>=2
        child1 = children[0]
        for child2 in children[1:]:
            id1, n_leaves1 = child1.comment
            id2, n_leaves2 = child2.comment
            total_leaves = n_leaves1 + n_leaves2
            anc.comment = (nodes, total_leaves)
            distance = tree_height - anc_d
            nodes += 1
            row = [id1, id2, distance, total_leaves]
            lnk_lst.append(row)
            child1 = anc
    return np.array(lnk_lst), id_map

Example

from scipy.cluster.hierarchy import dendrogram
newick_str="((A:1,B:1):5,(C:5,((D:2,E:2):2,(F:3,G:3):1):1):1);"
lnk_mt, _ = newick_string_to_linkage_matrix(newick_str)
_ = dendrogram(lnk_mt)

enter image description here