Using the output .pickle file derived from dRep, I extracted the linkage matrix and a dissimilarity db from my ani matrix (produced by dRep). dRrep produces a dendrogram that I would like to reproduce, then circularize, but I cannot reproduce the dendrogram.
For ease of checking whether or not the dengrogram clusters the genomes together based on the fclusters (which are correctly grouped by 95% ani/5% dissimilarity), I edited the genome names in the db such that they begin with the fcluster to which they belong. While I do get the correct fclusters when I display(fcluster) (i.e., it matches the matrix information correctly as well as the drep dendrogram), I do not get a dendrogram that clusters the genomes together at 5% dissimilarity (e.g., by fcluster).
I end up with samples that share even 0% ANI (100% dissimilarity), but usually it's more subtle differences (maybe 90% similiarty). In any event, genomes that should cluster are not, and genomes that shouldn't are.
here is the dendrogram output I am getting. You can see that some fcluster groups that should be together are, but not all, and many are embedded with groups that shouldn't be. I want this tree to group together those samples that share 95% ANI.
#python
import pickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import scipy.cluster.hierarchy as sch
import logging
import os
import sys
import numpy as np
import scipy.cluster
from scipy.spatial import distance as ssd
import drep.d_cluster.utils
pickle_file_path = 'secondary_linkage_cluster_0.pickle'
with open(pickle_file_path, 'rb') as file:
# Load the linkage matrix (first value)
linkage_matrix = pickle.load(file)
# Load the db (second value)
db = pickle.load(file)
# Load the dictionary of arguments (third value)
arguments = pickle.load(file)
#display(db)
db.to_csv('db.csv')
import re
db_clean = db.copy()
# Clean column names
db_clean.columns = [re.sub(r'(metaSpades_|_phage.*|\.phage.*|_scaffolds.*)', '', col) for col in db_clean.columns]
# Clean row names
db_clean.index = [re.sub(r'(metaSpades_|_phage.*|\.phage.*|_scaffolds.*)', '', index) for index in db_clean.index]
csv_data = pd.read_csv('Cdb_metagenome_species_fromAll_clean.csv')
# Iterate through each row in csv_data
for index, row in csv_data.iterrows():
genome_label = row['genome']
secondary_cluster_label = row['secondary_cluster']
# Check if the genome_label is in both rows and columns of db_clean
if genome_label in db_clean.index and genome_label in db_clean.columns:
# Update the row and column names in db_clean
db_clean = db_clean.rename(index={genome_label: f'{secondary_cluster_label}_{genome_label}'})
db_clean = db_clean.rename(columns={genome_label: f'{secondary_cluster_label}_{genome_label}'})
#display(db_clean)
display(arguments)
{'linkage_method': 'single',
'linkage_cutoff': 0.050000000000000044,
'comparison_algorithm': 'ANImf',
'minimum_coverage': 0.5}
# Use the same linkage_method and linkage_cutoff as used by dRep
linkage_method = arguments['linkage_method']
linkage_cutoff = arguments['linkage_cutoff']
fclust = fcluster(linkage_matrix, linkage_cutoff, criterion='distance')
#fclust = scipy.cluster.hierarchy.fcluster(linkage, linkage_cutoff, criterion='distance')
dissimilarity_threshold=0.14
cluster_threshold = 0.05
# Plot the dendrogram
fig, ax = plt.subplots(figsize=(60, 10))
dendrogram_result=dendrogram(linkage_matrix, labels=db_clean.index, orientation='top', leaf_rotation=90, leaf_font_size=8,color_threshold=dissimilarity_threshold)
plt.ylim([0, dissimilarity_threshold]) # Set x-axis limits
# Add a dotted line at the additional dissimilarity threshold
ax.axhline(y=cluster_threshold, color='black', linestyle='dotted')