Toxicity Prediction Task Overview
Definition: Majority of the drugs have some extents of toxicity to the human organisms. This learning task aims to predict accurately various types of toxicity of a drug molecule towards human organisms.
Impact: Toxicity is one of the primary causes of compound attrition. Study shows that approximately 70%% of all toxicity-related attrition occurs preclinically (i.e., in cells, animals) while they are strongly predictive of toxicities in humans. This suggests that an early but accurate prediction of toxicity can significantly reduce the compound attribution and boost the likelihood of being marketed.
Generalization: Similar to the ADME prediction, as the drug structures of interest evolve over time, toxicity prediction requires a model to generalize to a set of novel drugs with small structural similarity to the existing drug set.
Product: Small-molecule.
Pipeline: Efficacy and safety - lead development and optimization.
hERG blockers
Dataset Description: Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating. Thus, if a drug blocks the hERG, it could lead to severe adverse effects. Therefore, reliable prediction of hERG liability in the early stages of drug design is quite important to reduce the risk of cardiotoxicity-related attritions in the later development stages.
Task Description: Binary classification. Given a drug SMILES string, predict whether it blocks (1) or not blocks (0).
Dataset Statistics: 648 drugs.
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'hERG')
split = data.get_split()
References:
Dataset License: CC BY 4.0.
hERG Karim et al.
Dataset Description: A integrated Ether-a-go-go-related gene (hERG) dataset consisting of molecular structures labelled as hERG (<10uM) and non-hERG (>=10uM) blockers in the form of SMILES strings was obtained from the DeepHIT, the BindingDB database, ChEMBL bioactivity database, and other literature.
Task Description: Binary classification. Given a drug SMILES string, predict whether it blocks (1, <10uM) or not blocks (0, >=10uM).
Dataset Statistics: 13,445 drugs.
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'hERG_Karim')
split = data.get_split()
References:
Dataset License: CC BY 4.0.
AMES Mutagenicity
Dataset Description: Mutagenicity means the ability of a drug to induce genetic alterations. Drugs that can cause damage to the DNA can result in cell death or other severe adverse effects. Nowadays, the most widely used assay for testing the mutagenicity of compounds is the Ames experiment which was invented by a professor named Ames. The Ames test is a short-term bacterial reverse mutation assay detecting a large number of compounds which can induce genetic damage and frameshift mutations. The dataset is aggregated from four papers
Task Description: Binary classification. Given a drug SMILES string, predict whether it is mutagenic (1) or not mutagenic (0).
Dataset Statistics: 7,255 drugs.
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'AMES')
split = data.get_split()
References:
Dataset License: CC BY 4.0.
DILI (Drug Induced Liver Injury)
Dataset Description: Drug-induced liver injury (DILI) is fatal liver disease caused by drugs and it has been the single most frequent cause of safety-related drug marketing withdrawals for the past 50 years (e.g. iproniazid, ticrynafen, benoxaprofen). This dataset is aggregated from U.S. FDA’s National Center for Toxicological Research.
Task Description: Binary classification. Given a drug SMILES string, predict whether it can cause liver injury (1) or not (0).
Dataset Statistics: 475 drugs
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'DILI')
split = data.get_split()
References:
Dataset License: CC BY 4.0.
Skin Reaction
Dataset Description: Repetitive exposure to a chemical agent can induce an immune reaction in inherently susceptible individuals that leads to skin sensitization. The dataset used in this study was retrieved from the ICCVAM (Interagency Coordinating Committee on the Validation of Alternative Methods) report on the rLLNA.
Task Description: Binary classification. Given a drug SMILES string, predict whether it can cause skin reaction (1) or not (0).
Dataset Statistics: 404 drugs.
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'Skin Reaction')
split = data.get_split()
References:
Dataset License: CC BY 4.0.
Acute Toxicity LD50
Dataset Description: Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects. The higher the dose, the more lethal of a drug. This dataset is kindly provided by the authors of [1].
Task Description: Regression. Given a drug SMILES string, predict its acute toxicity.
Dataset Statistics: 7,385 drugs.
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'LD50_Zhu')
split = data.get_split()
References:
Dataset License: CC BY 4.0.
Carcinogens
Dataset Description: A carcinogen is any substance, radionuclide, or radiation that promotes carcinogenesis, the formation of cancer. This may be due to the ability to damage the genome or to the disruption of cellular metabolic processes.
Task Description: Binary classification. Given a drug SMILES string, predict whether it can cause carcinogen.
Dataset Statistics: 278 drugs.
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'Carcinogens_Lagunin')
split = data.get_split()
References:
Dataset License: CC BY 4.0.
ClinTox
Dataset Description: The ClinTox dataset includes drugs that have failed clinical trials for toxicity reasons and also drugs that are associated with successful trials.
Task Description: Binary classification. Given a drug SMILES string, predict the clinical toxicity.
Dataset Statistics: 1,484 drugs.
Dataset Split: Random Split Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'ClinTox')
split = data.get_split()
References:
Dataset License: CC BY 4.0.
hERG Central
Dataset Description: Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating. Thus, if a drug blocks the hERG, it could lead to severe adverse effects. Therefore, reliable prediction of hERG liability in the early stages of drug design is quite important to reduce the risk of cardiotoxicity-related attritions in the later development stages. There are three targets: hERG_at_1uM, hERG_at_10uM, and hERG_inhib.
Task Description:
- hERG_at_1uM: Regression. Given a drug SMILES string, predict the percent inhibition at a 1µM concentration.
- hERG_at_10uM: Regression. Given a drug SMILES string, predict the percent inhibition at a 10µM concentration.
- hERG_inhib: Binary classification. Given a drug SMILES string, predict whether it blocks (1) or not blocks (0). This is equivalent to whether hERG_at_10uM < -50, i.e. whether the compound has an IC50 of less than 10µM.
Dataset Statistics: 306,893 drugs.
Dataset Split: Random Split Scaffold Split
Note: Tox21 contains multiple assays data. To retrieve the specific labels for that assay, specify the label name in the label_name
variable to the data loader. You can find all available label names by calling:
from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('herg_central')
Then, go to the standard TDC data loader procedure with the label name specified.
from tdc.single_pred import Tox
data = Tox(name = 'herg_central', label_name = label_list[0])
split = data.get_split()
References:
Dataset License: Not Specified. CC BY 4.0.
Contributed by: Ben Birnbaum.
Tox21
Dataset Description: Tox21 is a data challenge which contains qualitative toxicity measurements for 7,831 compounds on 12 different targets, such as nuclear receptors and stree response pathways.
Task Description: Binary classification. Given a drug SMILES string, predict the toxicity in a specific assay.
Dataset Statistics: Depends on various assays ~6,000 drugs.
Dataset Split: Random Split Scaffold Split
Note: Tox21 contains multiple assays data. To retrieve the specific labels for that assay, specify the label name in the label_name
variable to the data loader. You can find all available label names by calling:
from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('Tox21')
Then, go to the standard TDC data loader procedure with the label name specified.
from tdc.single_pred import Tox
data = Tox(name = 'Tox21', label_name = label_list[0])
split = data.get_split()
References:
Dataset License: Not Specified. CC BY 4.0.
ToxCast
Dataset Description: ToxCast includes qualitative results of over 600 experiments on 8k compounds.
Task Description: Binary classification. Given a drug SMILES string, predict the toxicity in a specific assay.
Dataset Statistics: Depends on various assays from couple hundres to thousands of drugs.
Dataset Split: Random Split Scaffold Split
Note: ToxCast contains multiple assays data. To retrieve the specific labels for that assay, specify the label name in the label_name
variable to the data loader. You can find all available label names by calling:
from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('Toxcast')
Then, go to the standard TDC data loader procedure with the label name specified.
from tdc.single_pred import Tox
data = Tox(name = 'ToxCast', label_name = label_list[0])
split = data.get_split()
References:
Dataset License: CC BY 4.0.