Docking Scores

Description: Docking is a theoretical evaluation of affinity (free energy change of the binding process) between a ligand (a small molecule) and a target (a protein involved in a disease pathway). A docking evaluation usually includes conformational sampling of ligand and free energy change calculation. A molecule with higher affinity usually has a higher potential to poses higher bioactivity.

De novo molecular generation has been focusing on simple heuristic oracles, such as QED, LogP. Those oracles are either too easy to optimize or can produce unrealistic molecules. This is aptly summarized in Coley et al. (2019) [1] as: “The current evaluations for generative models do not reflect the complexity of real discovery problems.” Recent work by Cieplinski et al. (2020) [2] also titled: "We Should At Least Be Able To Design Molecules That Dock Well." Thus, we decided to include a meta oracle of the molecular docking method. We adopted the python wrapper from pyscreener [3] that allows easy access to various docking software, including vina, smina, qvina2, psovina and DOCK6. Users can specify the target based on their own interests while providing several typical oracle functions for the leaderboard.

Installation instruction: To use this oracle, the user should first install the pyscreener, its dependencies, and the external docking software user wants to use. For vina-type software, ADFR Suite and one of the docking software are required. For DOCK6, sphgen_cpp and chimera are needed apart from DOCK6. All external software should be accessible from the $PATH variable (one can use export PATH=$PATH:#dir to bin#). Then by specifying the directory of pyscreener to the oracle function, we can use the docking oracle. One can find the installation instruction of the pyscreener in https://github.com/coleygroup/pyscreener. It should take approximately 10 minutes to install locally. Please open an issue if you meet any problem.

from tdc import Oracle
# 1. One can specify the binding pocket by a docked pdb file
oracle = Oracle(name = 'Docking_Score', software='vina', 
                pyscreener_path = 'path/to/pyscreener', 
                receptors=['examples/docking/5WIU.pdb'], 
                docked_ligand_file='examples/docking/5WIU_with_ligand.pdb',
                buffer=10, path='./my_test/', num_worker=1, ncpu=4)

oracle('c1ccccc1')

# Docking: 100%|██████████| 1/1 [00:02<00:00,  2.69s/ligand]
# {'c1ccccc1': -4.4}

# 2. One can also specify the target and binding pocket with PDB ID and coordinates
oracle2 = Oracle(name = 'Docking_Score', software='vina',
                pyscreener_path = 'path/to/pyscreener', pdbids=['5WIU'], 
                center=(-18.2, 14.4, -16.1), size=(15.4, 13.9, 14.5),
                buffer=10, path='./my_test/', num_worker=1, ncpu=4)

# Note: the binding pocket can also be sepcified by residue indices, one can check the Pyscreener for more detail

References:

[1] Coley, Connor W., Natalie S. Eyke, and Klavs F. Jensen. “Autonomous discovery in the chemical sciences part II: Outlook.” Angewandte Chemie International Edition 59.52 (2020): 23414-23436.

[2] Cieplinski, Tobiasz, et al. “We should at least be able to Design Molecules that Dock Well.” arXiv preprint arXiv:2006.16955 (2020).

[3] Graff, David E., Eugene I. Shakhnovich, and Connor W. Coley. “Accelerating high-throughput virtual screening through molecular pool-based active learning.” arXiv preprint arXiv:2012.07127 (2020).


ASKCOS

Description: Gao and Coley [1] have demonstrated that surrogate scoring models cannot sufficiently determine the easiness to obtain a chemical, and therefore, in addition to the SA oracle, we provide a score generated by full retrosynthetic pathway analysis. TDC included interfaces for multiple types of retrosynthetic pathway analysis as oracles and provided flexible access to various results. ASKCOS (https://askcos.mit.edu) is the open-source software framework used in [1] that integrates efforts to generalize known chemistry to new substrates by learning to apply retrosynthetic transformations, to identify suitable reaction conditions, and to evaluate whether reactions are likely to be successful. The data-driven models are trained with USPTO and Reaxys databases.

Installation instruction: Users can first deploy ASKCOS on their server following their instructions (https://github.com/connorcoley/ASKCOS), and access the server with our oracle function. One can also use cloud resources like Google Cloud Platform, which is recommended by the authors. Note that it may take 5-10 minutes after deployment for the retro transformer workers to start up. One can check the status of their startup by looking at "server status". The whole deployment process on a Google Cloud virtual machine should take about 20 minutes. issue

For the sake of handiness of TDC and IP of the retrosynthetic analysis software, we utilize the API access of those software and require additional input to the oracle function.

from tdc import Oracle
askcos = Oracle(name = 'ASKCOS')
smiles = 'CCOCCOCC'
host_ip = 'http://xx.xx.xxx.xxx'
askcos(smiles, host_ip, output='plausibility')
# 0.942
askcos(smiles, host_ip, output='num_step')
# 3

'''
You can alsospecify all the parameters of retrosnythetic analysis by from the function:

askcos(smiles, host_ip, output='plausibility', save_json=False, file_name='tree_builder_result.json', num_trials=5,
           max_depth=9, max_branching=25, expansion_time=60, max_ppg=100, template_count=1000, max_cum_prob=0.999, 
           chemical_property_logic='none', max_chemprop_c=0, max_chemprop_n=0, max_chemprop_o=0, max_chemprop_h=0, 
           chemical_popularity_logic='none', min_chempop_reactants=5, min_chempop_products=5, filter_threshold=0.1, return_first='true')
'''

References:

[1] Gao, Wenhao, and Connor W. Coley. “The synthesizability of molecules proposed by generative models.” Journal of Chemical Information and Modeling (2020).

[2]Coley, Connor W., et al. “A robotic platform for flow synthesis of organic compounds informed by AI planning.” Science365.6453 (2019): eaax1566.


Molecule.one

Description: Molecule.one API estimates the synthetic accessibility of a molecule based on a number of factors including the number of steps in the predicted synthesis plan and the cost of the starting materials. Currently, the API token can be requested from the Molecule.one website and is provided on a one-to-one basis for research use. We are working with Molecule.one on providing a more open access in the near-term future.

Installation instruction:

  • Create an account at this link. Grab the API Token in your profile page.
  • Install molecule.one by pip install git+https://github.com/molecule-one/m1wrapper-python
from tdc import Oracle
m1 = Oracle(name = 'Molecule One Synthesis', api_token = 'XXXXX')
smiles = ['[H][C@@]12OC3=C(O)C=CC4=C3[C@@]11CCN(C)[C@]([H])(C4)[C@]1([H])C=C[C@@H]2O', 
        'CC(=O)NC1=CC=C(O)C=C1']
m1(smiles)
'''
{'[H][C@@]12OC3=C(O)C=CC4=C3[C@@]11CCN(C)[C@]([H])(C4)[C@]1([H])C=C[C@@H]2O': '10.000',
 'CC(=O)NC1=CC=C(O)C=C1': '1.1693'}
'''

References:

[1] Sacha, Mikołaj, Mikołaj Błaż, Piotr Byrski, Paweł Włodarczyk-Pruszyński, and Stanisław Jastrzębski. “Molecule Edit Graph Attention Network: Modeling Chemical Reactions as Sequences of Graph Edits.” arXiv:2006.15426 (2020).

[2] Liu, Cheng-Hao, et al. “RetroGNN: Approximating Retrosynthesis by Graph Neural Networks for De Novo Drug Design.” arXiv:2011.13042 (2020).


IBM RXN Synthetic Accessibility

Description: IBM RXN (https://rxn.res.ibm.com) is an AI platform integrating forward reaction prediction and retrosynthetic analysis. The backend of the IBM RXN retrosynthetic analysis is the Molecular Transformer model[1]. The model was mainly trained with USPTO, Pistachio databases. For the sake of handiness of TDC and IP of the retrosynthetic analysis software, we utilize the API access of those software and require additional input to the oracle function.

  • Create an account at this link. Grab the API Token in your profile page.
  • Install IBM RXN by pip install rxn4chemistry
from tdc import Oracle
oracle = Oracle(name = 'IBM_RXN')
smiles = 'CCOCCOCC'
key = 'apk-c9db......' # You can obtain a key from https://rxn.res.ibm.com
oracle(smiles, key)
# 0.983
oracle(smiles, key, output='result')
# {'retrosynthetic_paths': [{'id': '5fb1c4a98937a9000127a345',
#    'metadata': {},
#    'embed': {},
#    'computedFields': {},
#    'createdOn': 1605485737424,
#    'createdBy': 'system',
#    'modifiedOn': 1605485737424,
#    'modifiedBy': 'system',
#    'moleculeId': '5fb1c2078937a90001279fa0',
#    'retrosynthesisId': '5fb1c4a48937a9000127a336',
#    'sequenceId': '5fb1c4a98937a9000127a340',
#    'projectId': '5fb1c4868937a9000127a320',
#    'smiles': 'CCOCCOCC',
#    'confidence': 0.983,
#     ......

References:

[1] Schwaller, Philippe, et al. “Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction.” ACS central science 5.9 (2019): 1572-1583.


Glycogen Synthase Kinase 3 Beta (GSK3β)

Description: Glycogen synthase kinase 3 beta, also known as GSK3β, is an enzyme that in humans is encoded by the GSK3β gene. Abnormal regulation and expression of GSK3β is associated with an increased susceptibility towards bipolar disorder. The oracle is a random forest classifer using ECFP6 fingerprints using ExCAPE-DB dataset.

from tdc import Oracle
oracle = Oracle(name = 'GSK3B')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.03, 0.0, 0.0]

References:

[1] Li, Yibo, Liangren Zhang, and Zhenming Liu. “Multi-objective de novo drug design with conditional graph generative model.” Journal of cheminformatics 10.1 (2018): 33.

[2] Jin, Wengong, Regina Barzilay, and Tommi Jaakkola. “Multi-objective molecule generation using interpretable substructures.” ICML. 2020.

[3] Sun, Jiangming, et al. “ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics.” Journal of cheminformatics 9.1 (2017): 17.


c-Jun N-terminal Kinases-3 (JNK3)

Description: c-Jun N-terminal Kinases-3 (JNK3) belongs to the mitogen-activated protein kinase family, and are responsive to stress stimuli, such as cytokines, ultraviolet irradiation, heat shock, and osmotic shock. The oracle is a random forest classifer using ECFP6 fingerprints using ExCAPE-DB dataset.

from tdc import Oracle
oracle = Oracle(name = 'JNK3')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.01, 0.0, 0.01]

References:

[1] Li, Yibo, Liangren Zhang, and Zhenming Liu. “Multi-objective de novo drug design with conditional graph generative model.” Journal of cheminformatics 10.1 (2018): 33.

[2] Jin, Wengong, Regina Barzilay, and Tommi Jaakkola. “Multi-objective molecule generation using interpretable substructures.” ICML. 2020.

[3] Sun, Jiangming, et al. “ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics.” Journal of cheminformatics 9.1 (2017): 17.


Dopamine Receptor D2 (DRD2)

Description: DRD2 stands for dopamine type 2 receptor. The oracle is constructed by Olivercrona et al., using a support vector machine classifier with a Gaussian kernel with ECFP6 fingerprint on ExCAPE-DB dataset.

from tdc import Oracle
oracle = Oracle(name = 'DRD2')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.0015465365340340924, 0.0023541754878916416, 0.004715407010872501]

References:

[1] Jin, Wengong, et al. “Learning multimodal graph-to-graph translation for molecular optimization.” ICLR (2019).

[2] Olivecrona, Marcus, et al. “Molecular de-novo design through deep reinforcement learning.” Journal of cheminformatics 9.1 (2017): 48.

[3] Sun, Jiangming, et al. “ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics.” Journal of cheminformatics 9.1 (2017): 17.


Synthetic Accessibility (SA)

Description: Synthetic Accessibility Score stands for how hard or how easy it is to synthesize a given molecule, based on a combination of the molecule’s fragments contributions. The oracle is caluated via RDKit, using a set of chemical rules defined by Ertl et al.

from tdc import Oracle
oracle = Oracle(name = 'SA')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [2.706977149048555, 2.8548373344538067, 2.659973244931228]

References:

[1] Polykovskiy et al. “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models.”, Frontiers in Pharmacology. (2020).

[2] Ertl, Peter, and Ansgar Schuffenhauer. “Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.” Journal of cheminformatics 1.1 (2009): 8.


Quantitative Estimate of Drug-likeness (QED)

Description: QED stands for Quantitative Estimate of Drug-likeness. The oracle is caluated via RDKit, using a set of chemical rules about drug-likeliness defined by Bickerton et al.

from tdc import Oracle
oracle = Oracle(name = 'QED')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.7369335974098526, 0.7965866720151891, 0.9026967965647689]

References:

[1] Jin, Wengong, et al. “Learning multimodal graph-to-graph translation for molecular optimization.” ICLR (2019).

[2] Bickerton, G. Richard, et al. “Quantifying the chemical beauty of drugs.” Nature chemistry 4.2 (2012): 90-98.

[3] Sun, Jiangming, et al. “ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics.” Journal of cheminformatics 9.1 (2017): 17.


Octanol-water Partition Coefficient (LogP)

Description: The penalized logP score measures the solubility and synthetic accessibility of a compound. The oracle is caluated via RDKit.

from tdc import Oracle
oracle = Oracle(name = 'LogP')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [2.126496327138913, 0.073949389117486, 0.48850176431612924]

References:

[1] Jin, Wengong, et al. “Learning multimodal graph-to-graph translation for molecular optimization.” ICLR (2019).

[2] Kusner, Matt J., Brooks Paige, and José Miguel Hernández-Lobato. “Grammar variational autoencoder.” ICML 2017.


Rediscovery

Description: This oracle aims to rediscover the target molecule Celecoxib, Troglitazone, and Thiothixene. Specifically, it aims for the generated molecule to have high tanimoto similarity with Celecoxib. From Guacamol Benchmark.

from tdc import Oracle
oracle = Oracle(name = 'Rediscovery')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# {'Celecoxib': [0.14728682170542637, 0.11666666666666667, 0.09649122807017543], 'Troglitazone': [0.24427480916030533,  0.14615384615384616, 0.12903225806451613], 'Thiothixene': [0.17391304347826086, 0.15625, 0.17796610169491525]}

Note: You can also access individual oracle in the set. For example,

from tdc import Oracle
oracle = Oracle(name = 'Celecoxib_Rediscovery')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.14728682170542637, 0.11666666666666667, 0.09649122807017543]

TDC also provides an oracle that takes any SMILES string that users want to rediscover. For example,

from tdc import Oracle
oracle = Oracle(name = 'Rediscovery_Meta', target_smiles = 'CC(=O)OC1=CC=CC=C1C(=O)O')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.16666666666666666, 0.18072289156626506, 0.2191780821917808]

References:

Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.


Similarity/Dissimilarity

Description: This oracle aims to generate molecules similar/dissimilar to Aripiprazole/Albuterol/Mestranol. Note that these molecules should be removed from the training set. From Guacamol Benchmark.

from tdc import Oracle
oracle = Oracle(name = 'Similarity')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# {'Aripiprazole': [0.5356125356125356, 0.3908045977011494, 0.39143730886850153], 'Albuterol': [0.2772277227722772, 0.38095238095238093, 0.3589743589743589], 'Mestranol': [0.19460880999342536, 0.2567901234567901, 0.2612872238232469]}

Note: You can also access individual oracle in the set. For example,

from tdc import Oracle
oracle = Oracle(name = 'Aripiprazole_Similarity')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.5356125356125356, 0.3908045977011494, 0.39143730886850153]

TDC also provides an oracle that takes any SMILES string that users want to be similar/dissimilar with. For example,

from tdc import Oracle
oracle = Oracle(name = 'Similarity_Meta', target_smiles = 'CC(=O)OC1=CC=CC=C1C(=O)O')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.23076923076923078, 0.1951219512195122, 0.2361111111111111]

References:

Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.


Median Molecules

Description: This oracle aims to generate molecules that simultaneously maximize similarities with several molecules. From Guacamol Benchmark.

from tdc import Oracle
oracle = Oracle(name = 'Median')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# {'Median 1': [0.09722243533981723, 0.14166129393101462, 0.12765694770084507], 'Median 2': [0.12259690287307903, 0.11470387424947118, 0.11491261514365983]}

Note: You can also access individual oracle in the set. For example,

from tdc import Oracle
oracle = Oracle(name = 'Median 1')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.09722243533981723, 0.14166129393101462, 0.12765694770084507]

TDC also provides an oracle that takes any two SMILES strings that users want to simultaneously maximize similarities with. For example,

from tdc import Oracle
tadalafil_smiles = 'O=C1N(CC(N2C1CC3=C(C2C4=CC5=C(OCO5)C=C4)NC6=C3C=CC=C6)=O)C'
sildenafil_smiles = 'CCCC1=NN(C2=C1N=C(NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C'
oracle = Oracle(name = 'Median_Meta', target_smiles = (tadalafil_smiles, sildenafil_smiles))
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.12259690287307903, 0.11470387424947118, 0.11491261514365983]

References:

Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.


Isomers

Description: This oracle aims to generate molecules that correspond to a target molecular formula (e.g., C7H8N2O2). It assess theb flexibility of the model to generate molecules following a simple pattern (which is a priori unknown). From Guacamol Benchmark.

from tdc import Oracle
oracle = Oracle(name = 'Isomers')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# {'c7h8n2o2': [7.077155389805107e-22, 9.454886273886542e-18, 3.7105915150029394e-14], 'c9h10n2o2pf2cl': [3.775134544279098e-11, 4.944450501938644e-09, 1.1793585051615319e-07]}

Note: You can also access individual oracle in the set. For example,

from tdc import Oracle
oracle = Oracle(name = 'Isomers_C7H8N2O2')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [7.077155389805107e-22, 9.454886273886542e-18, 3.7105915150029394e-14]

TDC also provides an oracle that takes any SMILES string and then it would transform it to the chemical formula and use that as the comparison. For example,

from tdc import Oracle
oracle = Oracle(name = 'Isomers_Meta', target_smiles = 'CC(=O)OC1=CC=CC=C1C(=O)O')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [4.632351332478028e-57, 5.853717984129625e-37, 3.120771099829009e-32]

References:

Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.


Multi-Property Objective (MPO)

Description: This oracle measures multiple physiochemical properpties of known drug. So each drug corresponds to multiple-property objectives. It contains seven drugs (Osimertinib, Fexofenadine, Ranolazine, Perindopril, Amlodipine, Sitagliptin, Zaleplon) where each has various objectives. From Guacamol Benchmark.

from tdc import Oracle
oracle = Oracle(name = 'MPO')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# {'Osimertinib': [0.09011742702110873, 0.4083890176872189, 0.0069208742335098465], 'Fexofenadine': [0.4336446174984538, 0.5101327504385935, 0.01074314980818085], 'Ranolazine': [0.29285467466584664, 0.027222138370807142,  0.015384988076712304], 'Perindopril': [0.36023741111440966, 0.1540877417148235, 0.13584848674330968], 'Amlodipine': [0.461083967620704, 0.15454027643871737, 0.15152116723579184], 'Sitagliptin': [0.00562486906491877,  0.008394273324064522, 0.0036371294214424814], 'Zaleplon': [7.752152611462035e-05, 8.370947134491376e-05, 1.3261169904325478e-05]}

Note: You can also access individual oracle in the set. For example,

from tdc import Oracle
oracle = Oracle(name = 'Osimertinib_MPO')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.09011742702110873, 0.4083890176872189, 0.0069208742335098465]

References:

Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.


Valsartan SMARTS

Description: The valsartan SMARTS benchmark targets molecules containing a SMARTS pattern related to valsartan while being characterized by physicochemical properties corresponding to the sitagliptin molecule. From Guacamol Benchmark.

from tdc import Oracle
oracle = Oracle(name = 'Valsartan_SMARTS')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.0, 0.0, 0.0]

References:

Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.


Hop

Description: The Scaffold Hop and Decorator Hop benchmarks aim to maximize the similarity to a SMILES string, while keeping or excluding specific SMARTS patterns, mimicking the tasks of changing the scaffold of a compound while keeping specific substituents and keeping a scaffold fixed while changing the substitution pattern. From Guacamol Benchmark.

from tdc import Oracle
oracle = Oracle(name = 'Hop')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# {'Deco Hop': [0.5338365434669443, 0.5200860832137733, 0.5038648836670017], 'Scaffold Hop': [0.38446411012782694, 0.36368563685636857, 0.3391736019856913]}

Note: You can also access individual oracle in the set. For example,

from tdc import Oracle
oracle = Oracle(name = 'Scaffold Hop')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.38446411012782694, 0.36368563685636857, 0.3391736019856913]

References:

Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.