High-throughput Screening Prediction Task Overview

Definition: High-throughput screening (HTS) is the rapid automated testing of thousands to millions of samples for biological activity at the model organism, cellular, pathway, or molecular level. The assay readout can vary from target binding affinity to fluorescence microscopy of cells treated with drug. HTS can be applied to different kinds of therapeutics however most available data is from testing of small-molecule libraries. In this task, a machine learning model is asked to predict the experimental assay values given a small-molecule compound structure.

Impact: High throughput screening is a critical component of small-molecule drug discovery in both industrial and academic research settings. Increasingly more complex assays are now being automated to gain biological insights on compound activity at a large scale. However, there are still limitations on the time and cost for screening a large library that limit experimental throughput. Machine learning models that can predict experimental outcomes can alleviate these effects and save many times and costs by looking at a larger chemical space and narrowing down a small set of highly likely candidates for further smaller-scale HTS.

Generalization: The model should be able to generalize over structurally diverse drugs. It is also important for methods to generalize across cell lines. Drug dosage and measurement time points are also very important factors in determining the efficacy of the drug.

Product: Small-molecule.

Pipeline: Activity - hit identification.

SARS-CoV-2 In Vitro, Touret et al.

Dataset Description: An in-vitro screen of the Prestwick chemical library composed of 1,480 approved drugs in an infected cell-based assay. From MIT AiCures.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against SARSCoV2.

Dataset Statistics: 1,480 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'SARSCoV2_Vitro_Touret')
split = data.get_split()

References:

[1] Touret, F., Gilles, M., Barral, K. et al. In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. Sci Rep 10, 13093 (2020).

[2] MIT AI Cures.

Dataset License: CC BY 4.0.


SARS-CoV-2 3CL Protease, Diamond.

Dataset Description: A large XChem crystallographic fragment screen against SARS-CoV-2 main protease at high resolution. From MIT AiCures.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against SARSCoV2 3CL Protease.

Dataset Statistics: 879 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'SARSCoV2_3CLPro_Diamond')
split = data.get_split()

References:

[1] Diamond Light Source

[2] MIT AI Cures.

Dataset License: Not Specified. CC BY 4.0.


HIV

Dataset Description: The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. From MoleculeNet.

Task Description: Binary classification. Given a drug SMILES string, predict its activity against HIV virus.

Dataset Statistics: 41,127 drugs.

Dataset Split: Random Split Scaffold Split

from tdc.single_pred import HTS
data = HTS(name = 'HIV')
split = data.get_split()

References:

[1] AIDS Antiviral Screen Data.

[2] Wu, Zhenqin, et al. “MoleculeNet: a benchmark for molecular machine learning.” Chemical science 9.2 (2018): 513-530.

Dataset License: CC BY 4.0.