High-throughput Screening Prediction Task Overview

Definition: High-throughput screening (HTS) is the rapid automated testing of thousands to millions of samples for biological activity at the model organism, cellular, pathway, or molecular level. The assay readout can vary from target binding affinity to fluorescence microscopy of cells treated with drug. HTS can be applied to different kinds of therapeutics however most available data is from testing of small-molecule libraries. In this task, a machine learning model is asked to predict the experimental assay values given a small-molecule compound structure.

Impact: High throughput screening is a critical component of small-molecule drug discovery in both industrial and academic research settings. Increasingly more complex assays are now being automated to gain biological insights on compound activity at a large scale. However, there are still limitations on the time and cost for screening a large library that limit experimental throughput. Machine learning models that can predict experimental outcomes can alleviate these effects and save many times and costs by looking at a larger chemical space and narrowing down a small set of highly likely candidates for further smaller-scale HTS.

Generalization: The model should be able to generalize over structurally diverse drugs. It is also important for methods to generalize across cell lines. Drug dosage and measurement time points are also very important factors in determining the efficacy of the drug.

Product: Small-molecule.

Pipeline: Activity - hit identification.

SARS-CoV-2 In Vitro, Touret et al.

Dataset Description: An in-vitro screen of the Prestwick chemical library composed of 1,480 approved drugs in an infected cell-based assay. From MIT AiCures.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against SARSCoV2.

Dataset Statistics: 1,480 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'SARSCoV2_Vitro_Touret')
split = data.get_split()

References:

[1] Touret, F., Gilles, M., Barral, K. et al. In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. Sci Rep 10, 13093 (2020).

[2] MIT AI Cures.

Dataset License: CC BY 4.0.


SARS-CoV-2 3CL Protease, Diamond.

Dataset Description: A large XChem crystallographic fragment screen against SARS-CoV-2 main protease at high resolution. From MIT AiCures.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against SARSCoV2 3CL Protease.

Dataset Statistics: 879 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'SARSCoV2_3CLPro_Diamond')
split = data.get_split()

References:

[1] Diamond Light Source

[2] MIT AI Cures.

Dataset License: Not Specified. CC BY 4.0.


HIV

Dataset Description: The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. From MoleculeNet.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against HIV virus.

Dataset Statistics: 41,127 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'HIV')
split = data.get_split()

References:

[1] AIDS Antiviral Screen Data.

[2] Wu, Zhenqin, et al. “MoleculeNet: a benchmark for molecular machine learning.” Chemical science 9.2 (2018): 513-530.

Dataset License: CC BY 4.0.


Butkiewicz et al.

Dataset Description: These are nine high-quality high-throughput screening (HTS) datasets from [1]. These datasets were curated from HTS data at the PubChem database [2]. Typically, HTS categorizes small molecules into hit, inactive, or unspecified against a certain therapeutic target. However, a compound may be falsely classified as a hit due to experimental artifacts such as optical interference. Moreover, because the screening is performed without duplicates, and the cutoff is often set loose to minimize the false negative rates, the results from the primary screens often contain high false positive rates [3]. Hence the result from the primary screen is only used as the first iteration to reduce the compound library to a smaller set of further confirmatory tests. Here each dataset is carefully collated through confirmation screens to validate active compounds. The curation process is documented in [1]. Each dataset is identified by the PubChem Assay ID (AID). Features of the datasets: (1) At least 150 confirmed active compounds present; (2) Diverse target classes; (3) Realistic (large number and highly imbalanced label).

Task Description: Binary classification. Given a compound SMILES string, predict its activity against a diverse set of targets.

Dataset Statistics:

Protein Target Class PubChem AID Protein Target Total # of Molecules # of Active Molecules
GPCR 435008 Orexin1 Receptor 218,158 233
GPCR 1798 M1 Muscarinic Receptor Agonists 61,833 187
GPCR 435034 M1 Muscarinic Receptor Antagonists 61,756 362
Ion Channel 1843 Potassium Ion Channel Kir2.1 301,493 172
Ion Channel 2258 KCNQ2 Potassium Channel 302,405 213
Ion Channel 463087 Cav3 T-type Calcium Channels 100,875 703
Transporter 488997 Choline Transporter 302,306 252
Kinase 2689 Serine/Threonine Kinase 33 319,792 172
Enzyme 485290 Tyrosyl-DNA Phosphodiesterase 341,365 281

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
# Orexin1 Receptor
data = HTS(name = 'orexin1_receptor_butkiewicz')
df = data.get_data()
splits = data.get_split()

# M1 Muscarinic Receptor Agonists
data = HTS(name = 'm1_muscarinic_receptor_agonists_butkiewicz')
df = data.get_data()
splits = data.get_split()

# M1 Muscarinic Receptor Antagonists
data = HTS(name = 'm1_muscarinic_receptor_antagonists_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Potassium Ion Channel Kir2.1 
data = HTS(name = 'potassium_ion_channel_kir2.1_butkiewicz')
df = data.get_data()
splits = data.get_split()

# KCNQ2 Potassium Channel
data = HTS(name = 'kcnq2_potassium_channel_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Cav3 T-type Calcium Channels
data = HTS(name = 'cav3_t-type_calcium_channels_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Choline Transporter
data = HTS(name = 'choline_transporter_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Serine/Threonine Kinase 33
data = HTS(name = 'serine_threonine_kinase_33_butkiewicz')
df = data.get_data()
splits = data.get_split()

# Tyrosyl-DNA Phosphodiesterase
data = HTS(name = 'tyrosyl-dna_phosphodiesterase_butkiewicz')
df = data.get_data()
splits = data.get_split()

References:

[1] Butkiewicz, Mariusz, et al. “Benchmarking ligand-based virtual High-Throughput Screening with the PubChem database.” Molecules 18.1 (2013): 735-756.

[2] Kim, Sunghwan, et al. “PubChem 2019 update: improved access to chemical data.” Nucleic acids research 47.D1 (2019): D1102-D1109.

[3] Butkiewicz, Mariusz, et al. “High-throughput screening assay datasets from the pubchem database.” Chemical informatics (Wilmington, Del.) 3.1 (2017).

Dataset License: Not Specified. CC BY 4.0.