Molecular Property Cliff Prediction Task Overview
Definition: Activity cliffs are molecules with small differences in structure but large differences in potency. Activity cliffs play an important role in drug discovery, but the bioactivity of activity cliff compounds are notoriously difficult to predict.
Impact: Predicting molecular activity and modeling quantitative structure-activity relationships are crucial for drug discovery. Graph neural networks use molecular structures as frameworks to evaluate the biological activity of chemical compounds. They guide the selection and optimization of candidates for further development. However, current models often overlook activity cliffs (ACs), where structurally similar molecules exhibit different bioactivities. This oversight is due to latent spaces primarily optimized for structural features (Wan Xiang et al.).
Generalization: ACs, or activity cliffs, occur when structurally similar molecules have very different biological activities, creating challenges for accurate modeling. This is especially problematic in Graph Neural Networks (GNNs), where similar molecules are closely grouped in the latent space, leading to inaccurate predictions when their activities are significantly different. Overcoming these challenges is essential for improving the accuracy and reliability of forecasts related to molecular activities. Dealing with ACs directly at the level of compound pairs by predicting whether a matched molecular pair (MMP) forms an AC based on a predefined activity threshold can be a more practical approach. For example, they classify it as an MMP-cliff if the activity difference is more significant than 100-fold or an MMP-nonCliff if the activity difference is less than 10-fold. It has been observed that QSAR regression models have low sensitivity to ACs when the activities of both compounds in the MMP are unknown, as they are absent from the training set. Developing techniques to improve AC sensitivity could enhance the performance of QSAR models and offer a promising direction for future research (Wan Xiang et al.)
Product: Small Molecule
Pipeline: Efficacy and safety - lead development and optimization. Hit identification and hit-to-lead optimization
Wan Xiang et al.
Dataset Description: Benchmark datasets of molecular property cliff (MPC) in ACANet paper. Includes: 1) The 9 datasets of low sample size and narrow scaffolds (LSSNS) for molecular activity prediction LSSNS, 2) The 30 datasets of high sample size and mixed scaffolds (HSSMS) for molecular activity prediction. Datasets are the molecular activity prediction benchmark datasets that from MoleculeACE, 3) The 3 matched molecular pair (MMP) datasets of activity cliff classification. Datasets are from ACGCN, 4) The 10 datasets of ADMET properties in delta prediction. Datasets are from DeepDelta. More information can be found https://github.com/bidd-group/MPCD .
Information on each individual dataset can be found https://github.com/bidd-group/MPCD .
Task Description: Regression. Given a SMILES sequence, predict the activity cliff of the small molecule compound.
Dataset Statistics: More information on each individual dataset can be found https://github.com/bidd-group/MPCD .
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import MPC
data = MPC(name = "INSERT_URL_HERE") # url from the source github repo https://github.com/bidd-group/MPCD/tree/main/dataset
# example url: https://raw.githubusercontent.com/bidd-group/MPCD/main/dataset/ADMET/DeepDelta_benchmark/Caco2.csv
split = data.get_data()
We additionally support direct retrieval from the MoleculeACE API [2] for those datasets. You can call:
data = MPC(name = "INSERT_MOLECULEACE_HERE", get_from_gh = False) # name from MoleculeACE API https://github.com/molML/MoleculeACE?tab=readme-ov-file
References:
Dataset License: CC BY 4.0.