Quantum Mechanics Modeling Task Overview

Definition: The motion of molecules and protein targets can be described accurately with quantum theory, i.e., Quantum Mechanics (QM). However, ab initio quantum calculation of many-body system suffers from large computational overhead that is impractical for most applications. Various approximations have been applied to solve energy from electronic structure but all of them have a trade-off between accuracy and computational speed. Machine learning models raise a hope to break this bottleneck by leveraging the knowledge of existing chemical data. This task aims to predict the QM results given a drug's structural information.

Impact: A well-trained model can describe the potential energy surface accurately and quickly, so that more accurate and longer simulation of molecular systems are possible. The result of simulation can reveal the biological processes in molecular level and help study the function of protein targets and drug molecules.

Generalization: A machine learning model trained on a set of QM calculations require to extrapolate to unseen or structurally diverse set of compounds.

Product: Small-molecule.

Pipeline: Activity - lead development.

QM7b

Dataset Description: QM7 is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) composed of all molecules of up to 23 atoms (including 7 heavy atoms C, N, O, and S), where 14 properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW).

Task Description: Regression. Given a drug 3D xyz coordinates, predict the drug property.

Dataset Statistics: 7,211 drugs.

Dataset Split: Random Split

Note: QM7b contains multiple properties. To retrieve the specific labels for that property, specify the property name in the label_name variable to the data loader. You can find all available label names by calling:

from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('QM7b')

Then, go to the standard TDC data loader procedure with the label name specified.

from tdc.single_pred import QM
data = QM(name = 'QM7b', label_name = label_list[0])
split = data.get_split()

References:

[1] Blum, Lorenz C., and Jean-Louis Reymond. “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13.” Journal of the American Chemical Society 131.25 (2009): 8732-8733.

[2] Montavon, Grégoire, et al. “Machine learning of molecular electronic properties in chemical compound space.” New Journal of Physics 15.9 (2013): 095003.

Dataset License: Not Specified. CC BY 4.0.

QM8

Dataset Description: Electronic spectra and excited state energy of small molecules calculated by multiple quantum mechanic methods. Consisting of low-lying singlet-singlet vertical electronic spectra of over 20 000 synthetically feasible small organic molecules with up to eight CONF atom. From MoleculeNet and loaded from DeepChem.

Task Description: Regression. Given a drug 3D xyz coordinates, predict the drug property.

Dataset Statistics: 21,786 drugs.

Dataset Split: Random Split

Note: QM8 contains multiple properties. To retrieve the specific labels for that property, specify the property name in the label_name variable to the data loader. You can find all available label names by calling:

from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('QM8')

Then, go to the standard TDC data loader procedure with the label name specified.

from tdc.single_pred import QM
data = QM(name = 'QM8', label_name = label_list[0])
split = data.get_split()

References:

[1] Ruddigkeit, Lars, et al. “Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.” Journal of chemical information and modeling 52.11 (2012): 2864-2875.

[2] Ramakrishnan, Raghunathan, et al. “Electronic spectra from TDDFT and machine learning in chemical space.” The Journal of chemical physics 143.8 (2015): 084111.

[3] Ramsundar, Bharath, et al. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. “ O’Reilly Media, Inc.”, 2019.

[4] Wu, Zhenqin, et al. “MoleculeNet: a benchmark for molecular machine learning.” Chemical science 9.2 (2018): 513-530.

Dataset License: CC BY 4.0.

QM9

Dataset Description: computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. From MoleculeNet and loaded from DeepChem.

Task Description: Regression. Given a drug 3D xyz coordinates, predict the drug property.

Dataset Statistics: 133,885 drugs.

Dataset Split: Random Split

from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('QM9')

Then, go to the standard TDC data loader procedure with the label name specified.

from tdc.single_pred import QM
data = QM(name = 'QM9', label_name = label_list[0])
split = data.get_split()

References:

[1] Ruddigkeit, Lars, et al. “Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.” Journal of chemical information and modeling 52.11 (2012): 2864-2875.

[2] Ramakrishnan, Raghunathan, et al. “Electronic spectra from TDDFT and machine learning in chemical space.” The Journal of chemical physics 143.8 (2015): 084111.

[3] Ramsundar, Bharath, et al. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. “ O’Reilly Media, Inc.”, 2019.

[4] Wu, Zhenqin, et al. “MoleculeNet: a benchmark for molecular machine learning.” Chemical science 9.2 (2018): 513-530.

Dataset License: CC BY 4.0.