Is there a way to calculate overlap percentage/score from a perspective of smaller chemical structure?

116 views Asked by At

I'm trying to identify individual components of a hetero-molecule. For example, for A = B + C + D, taking A as the reference, I want to find B from B_list = [B1,B2,B3,...,Bn]. Identifying B works using A.HasSubstructMatch(B1/B2..Bn) from rdkit, however it is true only if there is 100% match. Can I identify B from B_list which has a match of say 99% with A.

I can't seem to find a way to calculate scores for items in B_list against A. The fingerprint (FP) scores are not helping because I want to get scores purely from perspective of B. The FP metrics work both ways!!

I tried using FP metric i.e., DataStructs.FingerprintSimilarity(A,B) but this doesn't seem to work as I expected. I generated FP for each B in B_list against A and used a cut-off to filter the list. However, my B (see below) with best FP metric doesn't seem to be a sub-structure of A. It is something else.

For your ref: A='COC1=CC(C2=CN(C)C(=O)C3=CN=CC=C23)=CC(OC)=C1CN1CCN(CCOCCOCC(=O)NCC2=CC=C(S(=O)(=O)NC3=CC=CC4=C3\[NH\]C=C4Cl)C=C2)CC1'

B_with_best_FP = 'CCOC1=CC(C(C)(C)C)=CC=C1C1=N\[C@@\](C)(C2=CC=C(Cl)C=C2)\[C@@\](C)(C2=CC=C(Cl)C=C2)N1C(=O)N1CCN(CCCS(C)(=O)=O)CC1'

My_ideal_B = B from B_list which has maximum sub-structure match with A with a score. B_with_best_FP isn't the one.

B_list = https://drive.google.com/file/d/1BjcY8BLJOE98-OQT_zRSoRr7tWcO72Nt/view?usp=sharing

1

There are 1 answers

0
user4959 On

Using Graph Cliques to Compute combined 2D similarities .

The maximum clique algorithm is applied to the Composite graph to find the largest complete subgraph. This corresponds to the largest common substructure shared by the two molecules.

Tanimoto Coefficient: The size of the maximum clique can be used to calculate the Tanimoto coefficient, a measure of similarity between the two molecules. The Tanimoto coefficient is defined as the ratio of the size of the maximum clique to the sum of the sizes of the two molecules minus the size of the maximum clique.