Paired Molecule Generation Task Overview

Definition: Paired molecule generation defines a set of molecule pairs (X,Y), where Y is a paraphrase of X with more desirable chemical property. In other words, the machine learning model aims to translate the input molecule X into a similar molecule Y with better property.

Impact: Lead optimization is a crucial step in drug discovery and consumes lots of time and trial. After a drug candidate hit is identified via high throughput screening, enhanced similar candidates are created and tested in order to find a lead compound with better properties than the original hit. Paired molecule generation is able to automate and accelerate the process.

Generalization: The generated molecules have to obtain superior properties given a range of structurally diverse drugs. Besides, the generated molecules have to suffice other basic properties, such as synthesizablility and low off-target effects.

Product: Small-molecule.

Pipeline: Efficacy and safety - lead development and optimization.

DRD2

Dataset Description: DRD2 stands for dopamine type 2 receptor. The model needs to translate inactive compounds (p < 0.05) into active compounds (p ≥ 0.5), where the bioactivity is assessed by a property prediction model oracle. The dataset is curated from ZINC.

Task Description: Given a molecule X, translate to another molecule Y with higher activity to DRD2.

Dataset Statistics: 34,404 molecule pairs.

Dataset Split: Random Split

from tdc.generation import PairMolGen
data = PairMolGen(name = 'DRD2')
split = data.get_split()

References:

[1] Jin, Wengong, et al. “Learning multimodal graph-to-graph translation for molecular optimization.” ICLR (2019).

[2] Olivecrona, Marcus, et al. “Molecular de-novo design through deep reinforcement learning.” Journal of cheminformatics 9.1 (2017): 48.

[3] Sterling, Teague, and John J. Irwin. “ZINC 15–ligand discovery for everyone.” Journal of chemical information and modeling 55.11 (2015): 2324-2337.

Dataset License: Not Specified. CC BY 4.0.

QED

Dataset Description: QED stands for Quantitative Estimate of Drug-likeness. The model needs to translate molecules with QED scores within the range [0.7, 0.8] into the higher range [0.9, 1.0]. The dataset is curated from ZINC.

Task Description: Given a molecule X, translate to another molecule Y with higher QED.

Dataset Statistics: 88,306 molecule pairs.

Dataset Split: Random Split

from tdc.generation import PairMolGen
data = PairMolGen(name = 'QED')
split = data.get_split()

References:

[1] Jin, Wengong, et al. “Learning multimodal graph-to-graph translation for molecular optimization.” ICLR (2019).

[2] Bickerton, G. Richard, et al. “Quantifying the chemical beauty of drugs.” Nature chemistry 4.2 (2012): 90-98.

[3] Sterling, Teague, and John J. Irwin. “ZINC 15–ligand discovery for everyone.” Journal of chemical information and modeling 55.11 (2015): 2324-2337.

Dataset License: Not Specified. CC BY 4.0.

LogP

Dataset Description: The penalized logP score measures the solubility and synthetic accessibility of a compound. In this task, the model needs to translate input X into output Y such that logP(Y) > logP(X). The dataset is curated from ZINC.

Task Description: Given a molecule X, translate to another molecule Y with higher LogP.

Dataset Statistics: 99,909 molecule pairs.

Dataset Split: Random Split

from tdc.generation import PairMolGen
data = PairMolGen(name = 'LogP')
split = data.get_split()

References:

[1] Jin, Wengong, et al. “Learning multimodal graph-to-graph translation for molecular optimization.” ICLR (2019).

[2] Kusner, Matt J., Brooks Paige, and José Miguel Hernández-Lobato. “Grammar variational autoencoder.” ICML 2017.

[3] Sterling, Teague, and John J. Irwin. “ZINC 15–ligand discovery for everyone.” Journal of chemical information and modeling 55.11 (2015): 2324-2337.

Dataset License: Not Specified. CC BY 4.0.