Molecule Generation Task Overview

Definition: Molecule Generation is to generate diverse, novel molecules that has desirable chemical properties. These properties are measured by oracle functions. A machine learning task first learns the molecular characteristics from a large set of molecules where each is evaluated through the oracles. Then, from the learned distribution, we can obtain novel candidates.

Impact: As the entire chemical space is far too large to screen for each target, high through screening can only be restricted to a set of existing molecule library. Many novel drug candidates are thus usually omitted. A machine learning that can generate novel molecule obeying some pre-defined optimal properties can circumvent this problem and obtain novel class of candidates.

Generalization: The generated molecules have to obtain superior properties given a range of structurally diverse drugs. Besides, the generated molecules have to suffice other basic properties, such as synthesizablility and low off-target effects.

Product: Small-molecule.

Pipeline: Efficacy and safety - lead development and optimization, activity - hit identification.

MOSES

Dataset Description: Molecular Sets (MOSES) is a benchmark platform for distribution learning based molecule generation. Within this benchmark, MOSES provides a cleaned dataset of molecules that are ideal of optimization. It is processed from the ZINC Clean Leads dataset.

Task Description: For both distribution learning-based and goal-oriented molecule generation. That is to generate new molecules that has desirable properties measured by some oracles.

Note: Please visit the this page for various goal-oriented learning oracles calculations and this page for distribution learning metrics.

Dataset Statistics: 1,936,962 molecules.

Dataset Split: Random Split

from tdc.generation import MolGen
data = MolGen(name = 'MOSES')
split = data.get_split()

References:

[1] Polykovskiy et al. “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models.”, Frontiers in Pharmacology. (2020).

[2] Sterling, Teague, and John J. Irwin. “ZINC 15–ligand discovery for everyone.” Journal of chemical information and modeling 55.11 (2015): 2324-2337.

Dataset License: CC BY-NC-SA 4.0.

ZINC

Dataset Description: ZINC is a free database of commercially-available compounds for virtual screening. It contains over 230 million purchasable compounds in ready-to-dock, 3D formats. TDC uses a 250,000 sampled version from the original Mol-VAE paper.

Task Description: For both distribution learning-based and goal-oriented molecule generation. That is to generate new molecules that has desirable properties measured by some oracles.

Note: Please visit the this page for various goal-oriented learning oracles calculations and this page for distribution learning metrics.

Dataset Statistics: 249,455 molecules.

Dataset Split: Random Split

from tdc.generation import MolGen
data = MolGen(name = 'ZINC')
split = data.get_split()

References:

[1] Sterling, Teague, and John J. Irwin. “ZINC 15–ligand discovery for everyone.” Journal of chemical information and modeling 55.11 (2015): 2324-2337.

[2] Gómez-Bombarelli, Rafael, et al. “Automatic chemical design using a data-driven continuous representation of molecules.” ACS central science 4.2 (2018): 268-276.

Dataset License: ZINC is free to use for everyone. Redistribution of significant subsets requires written permission from the authors.

ChEMBL

Dataset Description: ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

Task Description: For both distribution learning-based and goal-oriented molecule generation. That is to generate new molecules that has desirable properties measured by some oracles.

Note: Please visit the this page for various goal-oriented learning oracles calculations and this page for distribution learning metrics.

Dataset Statistics: 1,961,462 molecules.

Dataset Split: Random Split

from tdc.generation import MolGen
data = MolGen(name = 'ChEMBL')
split = data.get_split()

Note: Starting 0.3.5, ChEMBL-29 version is also available. It has 2084723 compounds. You can retrieve it by calling:

from tdc.generation import MolGen
data = MolGen(name = 'ChEMBL_V29')
split = data.get_split()

References:

[1] Mendez, David, et al. “ChEMBL: towards direct deposition of bioassay data.” Nucleic acids research 47.D1 (2019): D930-D940.

[2] Davies, Mark, et al. “ChEMBL web services: streamlining access to drug discovery data and utilities.” Nucleic acids research 43.W1 (2015): W612-W620.

Dataset License: CC BY-SA 3.0.