Molecule Generation Task Overview
Definition: Molecule Generation is to generate diverse, novel molecules that has desirable chemical properties. These properties are measured by oracle functions. A machine learning task first learns the molecular characteristics from a large set of molecules where each is evaluated through the oracles. Then, from the learned distribution, we can obtain novel candidates.
Impact: As the entire chemical space is far too large to screen for each target, high through screening can only be restricted to a set of existing molecule library. Many novel drug candidates are thus usually omitted. A machine learning that can generate novel molecule obeying some pre-defined optimal properties can circumvent this problem and obtain novel class of candidates.
Generalization: The generated molecules have to obtain superior properties given a range of structurally diverse drugs. Besides, the generated molecules have to suffice other basic properties, such as synthesizablility and low off-target effects.
Product: Small-molecule.
Pipeline: Efficacy and safety - lead development and optimization, activity - hit identification.
MOSES
Dataset Description: Molecular Sets (MOSES) is a benchmark platform for distribution learning based molecule generation. Within this benchmark, MOSES provides a cleaned dataset of molecules that are ideal of optimization. It is processed from the ZINC Clean Leads dataset.
Task Description: For both distribution learning-based and goal-oriented molecule generation. That is to generate new molecules that has desirable properties measured by some oracles.
Note: Please visit the this page for various goal-oriented learning oracles calculations and this page for distribution learning metrics.
Dataset Statistics: 1,936,962 molecules.
Dataset Split: Random Split
from tdc.generation import MolGen
data = MolGen(name = 'MOSES')
split = data.get_split()
References:
Dataset License: CC BY-NC-SA 4.0.
ZINC
Dataset Description: ZINC is a free database of commercially-available compounds for virtual screening. It contains over 230 million purchasable compounds in ready-to-dock, 3D formats. TDC uses a 250,000 sampled version from the original Mol-VAE paper.
Task Description: For both distribution learning-based and goal-oriented molecule generation. That is to generate new molecules that has desirable properties measured by some oracles.
Note: Please visit the this page for various goal-oriented learning oracles calculations and this page for distribution learning metrics.
Dataset Statistics: 249,455 molecules.
Dataset Split: Random Split
from tdc.generation import MolGen
data = MolGen(name = 'ZINC')
split = data.get_split()
References:
Dataset License: ZINC is free to use for everyone. Redistribution of significant subsets requires written permission from the authors.
ChEMBL
Dataset Description: ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
Task Description: For both distribution learning-based and goal-oriented molecule generation. That is to generate new molecules that has desirable properties measured by some oracles.
Note: Please visit the this page for various goal-oriented learning oracles calculations and this page for distribution learning metrics.
Dataset Statistics: 1,961,462 molecules.
Dataset Split: Random Split
from tdc.generation import MolGen
data = MolGen(name = 'ChEMBL')
split = data.get_split()
Note: Starting 0.3.5
, ChEMBL-29 version is also available. It has 2084723 compounds. You can retrieve it by calling:
from tdc.generation import MolGen
data = MolGen(name = 'ChEMBL_V29')
split = data.get_split()
References:
Dataset License: CC BY-SA 3.0.