Quantum Mechanics Modeling Task Overview
Definition: The motion of molecules and protein targets can be described accurately with quantum theory, i.e., Quantum Mechanics (QM). However, ab initio quantum calculation of many-body system suffers from large computational overhead that is impractical for most applications. Various approximations have been applied to solve energy from electronic structure but all of them have a trade-off between accuracy and computational speed. Machine learning models raise a hope to break this bottleneck by leveraging the knowledge of existing chemical data. This task aims to predict the QM results given a drug's structural information.
Impact: A well-trained model can describe the potential energy surface accurately and quickly, so that more accurate and longer simulation of molecular systems are possible. The result of simulation can reveal the biological processes in molecular level and help study the function of protein targets and drug molecules.
Generalization: A machine learning model trained on a set of QM calculations require to extrapolate to unseen or structurally diverse set of compounds.
Product: Small-molecule.
Pipeline: Activity - lead development.
QM7b
Dataset Description: QM7 is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) composed of all molecules of up to 23 atoms (including 7 heavy atoms C, N, O, and S), where 14 properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW).
Task Description: Regression. Given a drug 3D xyz coordinates, predict the drug property.
Dataset Statistics: 7,211 drugs.
Dataset Split: Random Split
Note: QM7b contains multiple properties. To retrieve the specific labels for that property, specify the property name in the label_name
variable to the data loader. You can find all available label names by calling:
from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('QM7b')
Then, go to the standard TDC data loader procedure with the label name specified.
from tdc.single_pred import QM
data = QM(name = 'QM7b', label_name = label_list[0])
split = data.get_split()
References:
Dataset License: Not Specified. CC BY 4.0.
QM8
Dataset Description: Electronic spectra and excited state energy of small molecules calculated by multiple quantum mechanic methods. Consisting of low-lying singlet-singlet vertical electronic spectra of over 20 000 synthetically feasible small organic molecules with up to eight CONF atom. From MoleculeNet and loaded from DeepChem.
Task Description: Regression. Given a drug 3D xyz coordinates, predict the drug property.
Dataset Statistics: 21,786 drugs.
Dataset Split: Random Split
Note: QM8 contains multiple properties. To retrieve the specific labels for that property, specify the property name in the label_name
variable to the data loader. You can find all available label names by calling:
from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('QM8')
Then, go to the standard TDC data loader procedure with the label name specified.
from tdc.single_pred import QM
data = QM(name = 'QM8', label_name = label_list[0])
split = data.get_split()
References:
Dataset License: CC BY 4.0.
QM9
Dataset Description: computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. From MoleculeNet and loaded from DeepChem.
Task Description: Regression. Given a drug 3D xyz coordinates, predict the drug property.
Dataset Statistics: 133,885 drugs.
Dataset Split: Random Split
Note: QM8 contains multiple properties. To retrieve the specific labels for that property, specify the property name in the label_name
variable to the data loader. You can find all available label names by calling:
from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('QM9')
Then, go to the standard TDC data loader procedure with the label name specified.
from tdc.single_pred import QM
data = QM(name = 'QM9', label_name = label_list[0])
split = data.get_split()
References:
Dataset License: CC BY 4.0.