Single-cell Drug-Target Nomination (Identification) Task Overview

Definition: TDC-2 introduces TDC.scDTI task. The goal is to train a model for predicting the probability that a protein is a candidate therapeutic target in a specific cell type. The model learns an estimator for a function of a protein target and a cell-type-specific biological context as input, and the model is tasked to predict the probability the candidate protein is a therapeutic target in that cell type.

Impact: Single-cell data have enabled the study of gene expression and function at the level of individual cells across healthy and disease states. To facilitate biological discoveries using single-cell data, machine-learning models have been developed to capture the complex, cell-type-specific behavior of genes. In addition to providing the single-cell measurements and foundation models, TDC-2 supports the development of contextual AI models to nominate therapeutic targets in a cell type-specific manner.

Generalization: Models are expected to have strong performance on cell-context-specific evaluation metrics across different sets of disease-specific proteins and cells.

Product: Small-molecule.

Pipeline: Nomination / Identification.

(Li, Michelle, et al.)

Dataset Description: To curate target information for a therapeutic area, we examine the drugs indicated for the therapeutic area of interest and its descendants. The two therapeutic areas examined are rheumatoid arthritis (RA) and inflammatory bowel disease. Positive examples (i.e., where the label y = 1) are proteins targeted by drugs that have at least completed phase 2 of clinical trials for treating a specific therapeutic area. As such, a protein is a promising candidate if a compound that targets the protein is safe for humans and effective for treating the disease. We retain positive training examples activated in at least one cell type-specific protein interaction network. We define negative examples (i.e., where the label y = 0) as druggable proteins that do not have any known association with the therapeutic area of interest according to Open Targets. A protein is deemed druggable if targeted by at least one existing drug. We extract drugs and their nominal targets from Drugbank. We retain negative training examples activated in at least one cell type-specific protein interaction network.

Task Description: Classification. Given the protein and cell-context, predict whether the protein is a therapeutic target.

Dataset Statistics: The final number of positive (negative) samples for RA and IBD were 152 (1,465) and 114 (1,377), respectively. In PINNACLE, this dataset was augmented to include 156 cell types.

Dataset Split: Cold Protein Split We split the dataset such that about 80% of the proteins are in the training set, about 10% of the proteins are in the validation set, and about 10% of the proteins are in the test set. The data splits are consistent for each cell type context to avoid data leakage.

from tdc.resource.dataloader import DataLoader
data = DataLoader(name="opentargets_dti")
df = data.get_data()

References:

[1] Li, Michelle, et al. “Contextualizing Protein Representations Using Deep Learning on Protein Networks and Single-Cell Data” bioRxiv (2023)

Dataset License: CC BY 4.0 US.