Bioisosteric replacement using SMARTS (KNIME and RDKit)

579 views Asked by At

I am trying to create a KNIME workflow that would accept a list of compounds and carry out bioisosteric replacements (we will use the following example here: carboxylic acid to tetrazole) automatically.

NOTE: I am using the following workflow as inspiration : RDKit-bioisosteres (myexperiment.org). This uses a text file as SMARTS input. I cannot seem to replicate the SMARTS format used here.

For this, I plan to use the Rdkit One Component Reaction node which uses a set of compounds to carry out the reaction on as input and a SMARTS string that defines the reaction.

My issue is the generation of a working SMARTS string describing the reaction.

I would like to input two SDF files (or another format, not particularly attached to SDF): one with the group to replace (carboxylic acid) and one with the list of possible bioisosteric replacements (tetrazole). I would then combine these two in KNIME and generate a SMARTS string for the reaction to then be used in the Rdkit One Component Reaction node.

NOTE: The input SDF files have the structures written with an attachment point (*COOH for the carboxylic acid for example) which defines where the group to replace is attached. I suspect this is the cause of many of the issues I am experiencing.

So far, I can easily generate the reactions in RXN format using the Reaction Builder node from the Indigo node package. However, converting this reaction into a SMARTS string that is accepted by the Rdkit One Component Reaction node has proven tricky.

What I have tried so far:

  1. Converting RXN to SMARTS (Molecule Type Cast node) : gives the following error code : scanner: BufferScanner::read() error

  2. Converting the Source and Target molecules into SMARTS (Molecule Type Cast node) : gives the following error code : SMILES loader: unrecognised lowercase symbol: y

    • showing this as a string in KNIME shows that the conversion is not carried out and the string is of SDF format : *filename*.sdf 0 0 0 0 0 0 0 V3000M V30 BEGIN etc.
  3. Converting the Source and Target molecules into RDkit first (RDkit from Molecule node) then from RDkit into SMARTS (RDkit to Molecule node, SMARTS option). This outputs the following SMARTS strings:

    • Carboxylic acid : [#6](-[#8])=[#8]
    • Tetrazole : [#6]1:[#7H]:[#7]:[#7]:[#7]:1

This is as close as I've managed to get. I can then join these two smarts strings with >> in between (output: [#6](-[#8])=[#8]>>[#6]1:[#7H]:[#7]:[#7]:[#7]:1) to create a SMARTS reaction string but this is not accepted as an input for the Rdkit One Component Reaction node.

Error message in KNIME console : ERROR RDKit One Component Reaction 0:40 Creation of Reaction from SMARTS value failed: null WARN RDKit One Component Reaction 0:40 Invalid Reaction SMARTS: missing

Note that the SMARTS strings that this last option (3.) generates are very different than the ones used in the myexperiments.org example ([*:1][C:2]([OH])=O>>[*:1][C:2]1=NNN=N1). I also seem to have lost the attachment point information through these conversions which are likely to cause issues in the rest of the workflow.

Therefore I am looking for a way to generate the SMARTS strings used in the myexperiments.org example on my own sets of substituents. Obviously doing this by hand is not an option. I would also like this workflow to use only the open-source nodes available in KNIME and not proprietary nodes (Schrodinger etc.).

Hopefully, someone can help me out with this. If you need my current workflow I am happy to upload that with the source files if required.

Thanks in advance for your help,

Stay safe and healthy!

-Antoine

1

There are 1 answers

0
Monomaniac On

What you're describing is template generation, which has been a consistent field of work in reaction prediction and/or retrosynthesis in cheminformatics for a long time. I'm not particularly familiar with KNIME myself, though I know RDKit extensively: Your last option (3) is closest to what I'd consider a usable workflow. The way I would do this:

  1. Load the substitution pair molecules from SDF into RDKit mol objects.
  2. Export these RDKit mol objects as SMARTS strings rdkit.Chem.MolToSmarts().
  3. Concatenate these strings into the form before_substructure>>after_substructure to generate a reaction SMARTS string.
  4. Load this SMARTS string into a reaction object rxn = rdkit.Chem.AllChem.ReactionFromSmarts()
  5. Use the rxn.RunReactants() method to generate your bioisosterically substituted products.

The error you quote for the RDKit One Component Reaction node input cuts off just before the important information, unfortunately. Running rdkit.Chem.AllChem.ReactionFromSmarts("[#6](-[#8])=[#8]>>[#6]1:[#7H]:[#7]:[#7]:[#7]:1") produces no errors for me locally, which leads me to believe this is specific to the KNIME node functionality.

Note, that the difference between [#6](-[#8])=[#8] and [*:1][C:2]([OH])=O is relatively minimal: The former represents a O-C=O substructure, the latter represents a ~COOH group. Within the square brackets of the latter, the :num refers to an optional 'atom map' number, which allows a one-to-one mapping of reactant and product atoms. For example, [C:1][C:3].[C:2][C:4]>>[C:1][C:3][C:4][C:2] allows you to track which carbon is which during a reaction, for situations where it may matter. The token [*:1] means "any atom" and is equivalent to a wavey line in organic chemistry (and it is mapped to #1).
There are only two situations I can think of where [#6](-[#8])=[#8] and [*:1][C:2]([OH])=O might differ:

  • You have methanoic acid as a potential input for substitution (former will match, latter might not - I can't remember how implicit hydrogens are treated in this situation)
  • Inputs are over/under protonated. (COO- != COOH)

Converting these reaction SMARTS to RDKit reaction objects and running them on input molecule objects should potentially create a number of substituted products. Note: Typically, in extensive projects, there will be some SMARTS templates that require some degree of manual intervention - indicating attachment points, specifying explicit hydrogens, etc. If you need any help or have any questions don't hesitate to drop a comment and I'll do my best to help with specifics.