Dendrogram not arranging by fclusters appropriately

31 views Asked by At

Using the output .pickle file derived from dRep, I extracted the linkage matrix and a dissimilarity db from my ani matrix (produced by dRep). dRrep produces a dendrogram that I would like to reproduce, then circularize, but I cannot reproduce the dendrogram.

For ease of checking whether or not the dengrogram clusters the genomes together based on the fclusters (which are correctly grouped by 95% ani/5% dissimilarity), I edited the genome names in the db such that they begin with the fcluster to which they belong. While I do get the correct fclusters when I display(fcluster) (i.e., it matches the matrix information correctly as well as the drep dendrogram), I do not get a dendrogram that clusters the genomes together at 5% dissimilarity (e.g., by fcluster).

I end up with samples that share even 0% ANI (100% dissimilarity), but usually it's more subtle differences (maybe 90% similiarty). In any event, genomes that should cluster are not, and genomes that shouldn't are.

here is the dendrogram output I am getting. You can see that some fcluster groups that should be together are, but not all, and many are embedded with groups that shouldn't be. I want this tree to group together those samples that share 95% ANI.

#python
import pickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import scipy.cluster.hierarchy as sch
import logging
import os
import sys

import numpy as np
import scipy.cluster
from scipy.spatial import distance as ssd

import drep.d_cluster.utils

pickle_file_path = 'secondary_linkage_cluster_0.pickle'
with open(pickle_file_path, 'rb') as file:
    # Load the linkage matrix (first value)
    linkage_matrix = pickle.load(file)

    # Load the db (second value)
    db = pickle.load(file)

    # Load the dictionary of arguments (third value)
    arguments = pickle.load(file)

#display(db)
db.to_csv('db.csv')

import re

db_clean = db.copy()

# Clean column names
db_clean.columns = [re.sub(r'(metaSpades_|_phage.*|\.phage.*|_scaffolds.*)', '', col) for col in db_clean.columns]

# Clean row names
db_clean.index = [re.sub(r'(metaSpades_|_phage.*|\.phage.*|_scaffolds.*)', '', index) for index in db_clean.index]

csv_data = pd.read_csv('Cdb_metagenome_species_fromAll_clean.csv')
# Iterate through each row in csv_data
for index, row in csv_data.iterrows():
    genome_label = row['genome']
    secondary_cluster_label = row['secondary_cluster']

    # Check if the genome_label is in both rows and columns of db_clean
    if genome_label in db_clean.index and genome_label in db_clean.columns:
        # Update the row and column names in db_clean
        db_clean = db_clean.rename(index={genome_label: f'{secondary_cluster_label}_{genome_label}'})
        db_clean = db_clean.rename(columns={genome_label: f'{secondary_cluster_label}_{genome_label}'})


#display(db_clean)
display(arguments)
{'linkage_method': 'single',
 'linkage_cutoff': 0.050000000000000044,
 'comparison_algorithm': 'ANImf',
 'minimum_coverage': 0.5}

# Use the same linkage_method and linkage_cutoff as used by dRep
linkage_method = arguments['linkage_method']
linkage_cutoff = arguments['linkage_cutoff']

fclust = fcluster(linkage_matrix, linkage_cutoff, criterion='distance')
#fclust = scipy.cluster.hierarchy.fcluster(linkage, linkage_cutoff, criterion='distance')


dissimilarity_threshold=0.14
cluster_threshold = 0.05
# Plot the dendrogram
fig, ax = plt.subplots(figsize=(60, 10))
dendrogram_result=dendrogram(linkage_matrix, labels=db_clean.index, orientation='top', leaf_rotation=90, leaf_font_size=8,color_threshold=dissimilarity_threshold)
plt.ylim([0, dissimilarity_threshold])  # Set x-axis limits
# Add a dotted line at the additional dissimilarity threshold
ax.axhline(y=cluster_threshold, color='black', linestyle='dotted')
0

There are 0 answers