Paired Molecule Generation Task Overview
Definition: Paired molecule generation defines a set of molecule pairs (X,Y), where Y is a paraphrase of X with more desirable chemical property. In other words, the machine learning model aims to translate the input molecule X into a similar molecule Y with better property.
Impact: Lead optimization is a crucial step in drug discovery and consumes lots of time and trial. After a drug candidate hit is identified via high throughput screening, enhanced similar candidates are created and tested in order to find a lead compound with better properties than the original hit. Paired molecule generation is able to automate and accelerate the process.
Generalization: The generated molecules have to obtain superior properties given a range of structurally diverse drugs. Besides, the generated molecules have to suffice other basic properties, such as synthesizablility and low off-target effects.
Product: Small-molecule.
Pipeline: Efficacy and safety - lead development and optimization.
DRD2
Dataset Description: DRD2 stands for dopamine type 2 receptor. The model needs to translate inactive compounds (p < 0.05) into active compounds (p ≥ 0.5), where the bioactivity is assessed by a property prediction model oracle. The dataset is curated from ZINC.
Task Description: Given a molecule X, translate to another molecule Y with higher activity to DRD2.
Dataset Statistics: 34,404 molecule pairs.
Dataset Split: Random Split
from tdc.generation import PairMolGen
data = PairMolGen(name = 'DRD2')
split = data.get_split()
References:
Dataset License: Not Specified. CC BY 4.0.
QED
Dataset Description: QED stands for Quantitative Estimate of Drug-likeness. The model needs to translate molecules with QED scores within the range [0.7, 0.8] into the higher range [0.9, 1.0]. The dataset is curated from ZINC.
Task Description: Given a molecule X, translate to another molecule Y with higher QED.
Dataset Statistics: 88,306 molecule pairs.
Dataset Split: Random Split
from tdc.generation import PairMolGen
data = PairMolGen(name = 'QED')
split = data.get_split()
References:
Dataset License: Not Specified. CC BY 4.0.
LogP
Dataset Description: The penalized logP score measures the solubility and synthetic accessibility of a compound. In this task, the model needs to translate input X into output Y such that logP(Y) > logP(X). The dataset is curated from ZINC.
Task Description: Given a molecule X, translate to another molecule Y with higher LogP.
Dataset Statistics: 99,909 molecule pairs.
Dataset Split: Random Split
from tdc.generation import PairMolGen
data = PairMolGen(name = 'LogP')
split = data.get_split()
References:
Dataset License: Not Specified. CC BY 4.0.