High-throughput Screening Prediction Task Overview

Definition: High-throughput screening (HTS) is the rapid automated testing of thousands to millions of samples for biological activity at the model organism, cellular, pathway, or molecular level. The assay readout can vary from target binding affinity to fluorescence microscopy of cells treated with drug. HTS can be applied to different kinds of therapeutics however most available data is from testing of small-molecule libraries. In this task, a machine learning model is asked to predict the experimental assay values given a small-molecule compound structure.

Impact: High throughput screening is a critical component of small-molecule drug discovery in both industrial and academic research settings. Increasingly more complex assays are now being automated to gain biological insights on compound activity at a large scale. However, there are still limitations on the time and cost for screening a large library that limit experimental throughput. Machine learning models that can predict experimental outcomes can alleviate these effects and save many times and costs by looking at a larger chemical space and narrowing down a small set of highly likely candidates for further smaller-scale HTS.

Generalization: The model should be able to generalize over structurally diverse drugs. It is also important for methods to generalize across cell lines. Drug dosage and measurement time points are also very important factors in determining the efficacy of the drug.

Product: Small-molecule.

Pipeline: Activity - hit identification.

SARS-CoV-2 In Vitro, Touret et al.

Dataset Description: An in-vitro screen of the Prestwick chemical library composed of 1,480 approved drugs in an infected cell-based assay. From MIT AiCures.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against SARSCoV2.

Dataset Statistics: 1,480 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'SARSCoV2_Vitro_Touret')
split = data.get_split()

References:

[1] Touret, F., Gilles, M., Barral, K. et al. In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. Sci Rep 10, 13093 (2020).

[2] MIT AI Cures.

Dataset License: CC BY 4.0.

SARS-CoV-2 3CL Protease, Diamond.

Dataset Description: A large XChem crystallographic fragment screen against SARS-CoV-2 main protease at high resolution. From MIT AiCures.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against SARSCoV2 3CL Protease.

Dataset Statistics: 879 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'SARSCoV2_3CLPro_Diamond')
split = data.get_split()

References:

[1] Diamond Light Source

[2] MIT AI Cures.

Dataset License: Not Specified. CC BY 4.0.

HIV

Dataset Description: The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. From MoleculeNet.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against HIV virus.

Dataset Statistics: 41,127 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'HIV')
split = data.get_split()

References:

[1] AIDS Antiviral Screen Data.

[2] Wu, Zhenqin, et al. “MoleculeNet: a benchmark for molecular machine learning.” Chemical science 9.2 (2018): 513-530.

Dataset License: CC BY 4.0.

Butkiewicz et al.

Dataset Description: These are nine high-quality high-throughput screening (HTS) datasets from [1]. These datasets were curated from HTS data at the PubChem database [2]. Typically, HTS categorizes small molecules into hit, inactive, or unspecified against a certain therapeutic target. However, a compound may be falsely classified as a hit due to experimental artifacts such as optical interference. Moreover, because the screening is performed without duplicates, and the cutoff is often set loose to minimize the false negative rates, the results from the primary screens often contain high false positive rates [3]. Hence the result from the primary screen is only used as the first iteration to reduce the compound library to a smaller set of further confirmatory tests. Here each dataset is carefully collated through confirmation screens to validate active compounds. The curation process is documented in [1]. Each dataset is identified by the PubChem Assay ID (AID). Features of the datasets: (1) At least 150 confirmed active compounds present; (2) Diverse target classes; (3) Realistic (large number and highly imbalanced label).

Task Description: Binary classification. Given a compound SMILES string, predict its activity against a diverse set of targets.

Dataset Statistics:

Protein Target Class	PubChem AID	Protein Target	Total # of Molecules	# of Active Molecules
GPCR	435008	Orexin1 Receptor	218,158	233
GPCR	1798	M1 Muscarinic Receptor Agonists	61,833	187
GPCR	435034	M1 Muscarinic Receptor Antagonists	61,756	362
Ion Channel	1843	Potassium Ion Channel Kir2.1	301,493	172
Ion Channel	2258	KCNQ2 Potassium Channel	302,405	213
Ion Channel	463087	Cav3 T-type Calcium Channels	100,875	703
Transporter	488997	Choline Transporter	302,306	252
Kinase	2689	Serine/Threonine Kinase 33	319,792	172
Enzyme	485290	Tyrosyl-DNA Phosphodiesterase	341,365	281

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
# Orexin1 Receptor
data = HTS(name = 'orexin1_receptor_butkiewicz')
df = data.get_data()
splits = data.get_split()

# M1 Muscarinic Receptor Agonists
data = HTS(name = 'm1_muscarinic_receptor_agonists_butkiewicz')
df = data.get_data()
splits = data.get_split()

# M1 Muscarinic Receptor Antagonists
data = HTS(name = 'm1_muscarinic_receptor_antagonists_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Potassium Ion Channel Kir2.1 
data = HTS(name = 'potassium_ion_channel_kir2.1_butkiewicz')
df = data.get_data()
splits = data.get_split()

# KCNQ2 Potassium Channel
data = HTS(name = 'kcnq2_potassium_channel_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Cav3 T-type Calcium Channels
data = HTS(name = 'cav3_t-type_calcium_channels_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Choline Transporter
data = HTS(name = 'choline_transporter_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Serine/Threonine Kinase 33
data = HTS(name = 'serine_threonine_kinase_33_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Tyrosyl-DNA Phosphodiesterase
data = HTS(name = 'tyrosyl-dna_phosphodiesterase_butkiewicz')
df = data.get_data()
splits = data.get_split()

References:

[1] Butkiewicz, Mariusz, et al. “Benchmarking ligand-based virtual High-Throughput Screening with the PubChem database.” Molecules 18.1 (2013): 735-756.

[2] Kim, Sunghwan, et al. “PubChem 2019 update: improved access to chemical data.” Nucleic acids research 47.D1 (2019): D1102-D1109.

[3] Butkiewicz, Mariusz, et al. “High-throughput screening assay datasets from the pubchem database.” Chemical informatics (Wilmington, Del.) 3.1 (2017).

Dataset License: Not Specified. CC BY 4.0.