Toxicity Prediction Task Overview

Definition: Majority of the drugs have some extents of toxicity to the human organisms. This learning task aims to predict accurately various types of toxicity of a drug molecule towards human organisms.

Impact: Toxicity is one of the primary causes of compound attrition. Study shows that approximately 70% of all toxicity-related attrition occurs preclinically (i.e., in cells, animals) while they are strongly predictive of toxicities in humans. This suggests that an early but accurate prediction of toxicity can significantly reduce the compound attribution and boost the likelihood of being marketed.

Generalization: Similar to the ADME prediction, as the drug structures of interest evolve over time, toxicity prediction requires a model to generalize to a set of novel drugs with small structural similarity to the existing drug set.

Product: Small-molecule.

Pipeline: Efficacy and safety - lead development and optimization.

Acute Toxicity LD50

Dataset Description: Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects. The higher the dose, the more lethal of a drug. This dataset is kindly provided by the authors of [1].

Task Description: Regression. Given a drug SMILES string, predict its acute toxicity.

Dataset Statistics: 7,385 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import Tox
data = Tox(name = 'LD50_Zhu')
split = data.get_split()

References:

[1] Zhu, Hao, et al. “Quantitative structure− activity relationship modeling of rat acute toxicity by oral exposure.” Chemical research in toxicology 22.12 (2009): 1913-1921.

Dataset License: Not Specified. CC BY 4.0.

hERG blockers

Dataset Description: Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating. Thus, if a drug blocks the hERG, it could lead to severe adverse effects. Therefore, reliable prediction of hERG liability in the early stages of drug design is quite important to reduce the risk of cardiotoxicity-related attritions in the later development stages.

Task Description: Binary classification. Given a drug SMILES string, predict whether it blocks (1) or not blocks (0).

Dataset Statistics: 648 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import Tox
data = Tox(name = 'hERG')
split = data.get_split()

References:

[1] Wang, Shuangquan, et al. “ADMET evaluation in drug discovery. 16. Predicting hERG blockers by combining multiple pharmacophores and machine learning approaches.” Molecular Pharmaceutics 13.8 (2016): 2855-2866.

Dataset License: Not Specified. CC BY 4.0.

hERG Central

Task Description:

hERG_at_1uM: Regression. Given a drug SMILES string, predict the percent inhibition at a 1µM concentration.
hERG_at_10uM: Regression. Given a drug SMILES string, predict the percent inhibition at a 10µM concentration.
hERG_inhib: Binary classification. Given a drug SMILES string, predict whether it blocks (1) or not blocks (0). This is equivalent to whether hERG_at_10uM < -50, i.e. whether the compound has an IC50 of less than 10µM.

Dataset Statistics: 306,893 drugs.

Dataset Split: Random Split Scaffold Split

Note: Tox21 contains multiple assays data. To retrieve the specific labels for that assay, specify the label name in the label_name variable to the data loader. You can find all available label names by calling:

from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('herg_central')

Then, go to the standard TDC data loader procedure with the label name specified.

from tdc.single_pred import Tox
data = Tox(name = 'herg_central', label_name = label_list[0])
split = data.get_split()

References:

[1] Du F, Yu H, Zou B, Babcock J, Long S, Li M. hERGCentral: a large database to store, retrieve, and analyze compound-human Ether-à-go-go related gene channel interactions to facilitate cardiotoxicity assessment in drug development. Assay Drug Dev Technol. 2011 Dec;9(6):580-8. doi: 10.1089/adt.2011.0425.

Dataset License: Not Specified. CC BY 4.0.

Contributed by: Ben Birnbaum.

hERG Karim et al.

Dataset Description: A integrated Ether-a-go-go-related gene (hERG) dataset consisting of molecular structures labelled as hERG (<10uM) and non-hERG (>=10uM) blockers in the form of SMILES strings was obtained from the DeepHIT, the BindingDB database, ChEMBL bioactivity database, and other literature.

Task Description: Binary classification. Given a drug SMILES string, predict whether it blocks (1, <10uM) or not blocks (0, >=10uM).

Dataset Statistics: 13,445 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import Tox
data = Tox(name = 'hERG_Karim')
split = data.get_split()

References:

[1] Karim, A., et al. CardioTox net: a robust predictor for hERG channel blockade based on deep learning meta-feature ensembles. J Cheminform 13, 60 (2021).

Dataset License: Not Specified. CC BY 4.0.

Ames Mutagenicity

Dataset Description: Mutagenicity means the ability of a drug to induce genetic alterations. Drugs that can cause damage to the DNA can result in cell death or other severe adverse effects. Nowadays, the most widely used assay for testing the mutagenicity of compounds is the Ames experiment which was invented by a professor named Ames. The Ames test is a short-term bacterial reverse mutation assay detecting a large number of compounds which can induce genetic damage and frameshift mutations. The dataset is aggregated from four papers.

Task Description: Binary classification. Given a drug SMILES string, predict whether it is mutagenic (1) or not mutagenic (0).

Dataset Statistics: 7,255 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import Tox
data = Tox(name = 'AMES')
split = data.get_split()

References:

[1] Xu, Congying, et al. “In silico prediction of chemical Ames mutagenicity.” Journal of chemical information and modeling 52.11 (2012): 2840-2847.

Dataset License: Not Specified. CC BY 4.0.

DILI (Drug Induced Liver Injury)

Dataset Description: Drug-induced liver injury (DILI) is fatal liver disease caused by drugs and it has been the single most frequent cause of safety-related drug marketing withdrawals for the past 50 years (e.g. iproniazid, ticrynafen, benoxaprofen). This dataset is aggregated from U.S. FDA’s National Center for Toxicological Research.

Task Description: Binary classification. Given a drug SMILES string, predict whether it can cause liver injury (1) or not (0).

Dataset Statistics: 475 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import Tox
data = Tox(name = 'DILI')
split = data.get_split()

References:

[1] Xu, Youjun, et al. “Deep learning for drug-induced liver injury.” Journal of chemical information and modeling 55.10 (2015): 2085-2093.

Dataset License: Not Specified. CC BY 4.0.

Skin Reaction

Dataset Description: Repetitive exposure to a chemical agent can induce an immune reaction in inherently susceptible individuals that leads to skin sensitization. The dataset used in this study was retrieved from the ICCVAM (Interagency Coordinating Committee on the Validation of Alternative Methods) report on the rLLNA.

Task Description: Binary classification. Given a drug SMILES string, predict whether it can cause skin reaction (1) or not (0).

Dataset Statistics: 404 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import Tox
data = Tox(name = 'Skin Reaction')
split = data.get_split()

References:

[1] Alves, Vinicius M., et al. “Predicting chemically-induced skin reactions. Part I: QSAR models of skin sensitization and their application to identify potentially hazardous compounds.” Toxicology and applied pharmacology 284.2 (2015): 262-272.

[2] The reduced murine local lymph node assay: an alternative test method using fewer animals to assess the allergic contact dermatitis potential of chemicals and products.

Dataset License: Not Specified. CC BY 4.0.

Carcinogens

Dataset Description: A carcinogen is any substance, radionuclide, or radiation that promotes carcinogenesis, the formation of cancer. This may be due to the ability to damage the genome or to the disruption of cellular metabolic processes.

Task Description: Binary classification. Given a drug SMILES string, predict whether it can cause carcinogen.

Dataset Statistics: 278 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import Tox
data = Tox(name = 'Carcinogens_Lagunin')
split = data.get_split()

References:

[1] Lagunin, Alexey, et al. “Computer‐aided prediction of rodent carcinogenicity by PASS and CISOC‐PSCT.” QSAR & Combinatorial Science 28.8 (2009): 806-810.

[2] Cheng, Feixiong, et al. “admetSAR: a comprehensive source and free tool for assessment of chemical ADMET properties.” (2012): 3099-3105.

Dataset License: Not Specified. CC BY 4.0.

Tox21

Dataset Description: Tox21 is a data challenge which contains qualitative toxicity measurements for 7,831 compounds on 12 different targets, such as nuclear receptors and stree response pathways.

Task Description: Binary classification. Given a drug SMILES string, predict the toxicity in a specific assay.

Dataset Statistics: Depends on various assays ~6,000 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('Tox21')

Then, go to the standard TDC data loader procedure with the label name specified.

from tdc.single_pred import Tox
data = Tox(name = 'Tox21', label_name = label_list[0])
split = data.get_split()

References:

[1] Tox21 Challenge.

Dataset License: Not Specified. CC BY 4.0.

ToxCast

Dataset Description: ToxCast includes qualitative results of over 600 experiments on 8k compounds.

Task Description: Binary classification. Given a drug SMILES string, predict the toxicity in a specific assay.

Dataset Statistics: Depends on various assays from couple hundres to thousands of drugs.

Dataset Split: Random Split Scaffold Split

Note: ToxCast contains multiple assays data. To retrieve the specific labels for that assay, specify the label name in the label_name variable to the data loader. You can find all available label names by calling:

from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('Toxcast')

Then, go to the standard TDC data loader procedure with the label name specified.

from tdc.single_pred import Tox
data = Tox(name = 'ToxCast', label_name = label_list[0])
split = data.get_split()

References:

[1] Richard, Ann M., et al. “ToxCast chemical landscape: paving the road to 21st century toxicology.” Chemical research in toxicology 29.8 (2016): 1225-1251.

Dataset License: CC BY 4.0.

ClinTox

Dataset Description: The ClinTox dataset includes drugs that have failed clinical trials for toxicity reasons and also drugs that are associated with successful trials.

Task Description: Binary classification. Given a drug SMILES string, predict the clinical toxicity.

Dataset Statistics: 1,484 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import Tox
data = Tox(name = 'ClinTox')
split = data.get_split()

References:

[1] Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. “A data-driven approach to predicting successes and failures of clinical trials.” Cell chemical biology 23.10 (2016): 1294-1301.

Dataset License: Not Specified. CC BY 4.0.