Structure-based Drug Design Task Overview

Definition: Structure-based Drug Design is to generate diverse, novel molecules that have high binding affinity to protein pockets (3D structures) and desirable chemical properties. These properties are measured by oracle functions. A machine learning task first learns the molecular characteristics given specific protein pockets from a large set of protein-ligand pair data. Then, from the learned conditional distribution, we can sample novel candidates.

Impact: Designing a new drug candidate taking account into its structure and potential interaction with biological targets is of great importance to drug design (often referred as to structure-based drug design). Recent advances in machine learning, especially geometric deep learning have brought a new set of tools and a new wave for modeling highly structural data (including 3D biomolecular structures). Thus, structure-based drug design task is potentially of interest to both ML methodological advancements and applications in drug design.

Generalization: The generated molecules have to obtain superior properties given a range of high binding affinity, and structurally diverse drugs. Besides, the generated molecules have to suffice other basic properties, such as synthesizablility and drug-likeness.

Product: Small-molecule.

Pipeline: Efficacy and safety - lead development and optimization, activity - hit identification.

PDBBind

Dataset Description: PDBBind is a comprehensive database extracted from PDB with experimentally measured binding affinity data for protein-ligand complexes.

Task Description: SBDD is to generate ligand that binds to protein pocket and has desirable properties measured by some oracles.

Note: : PDBBind does not allow re-distribution of the dataset in any format, thus, we could not host it in TDC server. However, since it requires significant processing to make the dataset ML ready, we provide an alternative route to do it. The user only needs to register in http://www.pdbbind.org.cn/ and download the raw dataset, and then provide the local path, TDC will then automatically detect the path and transform it to ML-ready format into the TDC dataloader.

Dataset Statistics: 19,445 protein-ligand pairs.

Dataset Split: Random Split

from tdc.generation import SBDD
data = SBDD(name='PDBBind', path='./pdbbind')
split = data.get_split()

References:

[1] Wang, Renxiao, Xueliang Fang, Yipin Lu, and Shaomeng Wang. “The PDBbind database: Collection of binding affinities for protein−ligand complexes with known three-dimensional structures.” Journal of medicinal chemistry 47, no. 12 (2004): 2977-2980.

Dataset License: See note above.

DUD-E

Dataset Description: DUD-E provides a directory of useful decoys for protein-ligand docking.

Task Description: SBDD is to generate ligand that binds to protein pocket and has desirable properties measured by some oracles.

Note: : DUD-E does not support pocket extraction as protein and ligand are not aligned.

Dataset Statistics: 22,886 active compounds and affinities against 102 targets

Dataset Split: Random Split

from tdc.generation import SBDD
data = SBDD(name='dude')
split = data.get_split()

References:

[1] Mysinger, M.M., Carchia, M., Irwin, J.J. and Shoichet, B.K., 2012. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. Journal of medicinal chemistry, 55(14), pp.6582-6594.

Dataset License: Not specified.

scPDB

Dataset Description: scPDB is processed from PDB for structure-based drug design that identifies suitable binding site for protein-ligand docking.

Task Description: SBDD is to generate ligand that binds to protein pocket and has desirable properties measured by some oracles.

Dataset Statistics: 16,034 protein-ligand pairs over 4,782 proteins and 6,326 ligands

Dataset Split: Random Split

from tdc.generation import SBDD
data = SBDD(name='scPDB')
split = data.get_split()

References:

[1]Meslamani, J., Rognan, D. and Kellenberger, E., 2011. sc-PDB: a database for identifying variations and multiplicity of ‘druggable’ binding sites in proteins. Bioinformatics, 27(9), pp.1324-1326.

Dataset License: Not specified.