Counterfactual Prediction Task Overview

Definition: We define a task for predicting responses in gene expression of single cells to chemical and genetic perturbations, aiming to measure model generalization across cell lines and perturbation types. Understanding cellular responses to genetic perturbation is central to numerous biomedical applications, from identifying genetic interactions involved in cancer to developing methods for regenerative medicine. Furthermore, counterfactual prediction of drug-based perturbations at single-cell resolution enables cell-type specific drugs and treatments, facilitating precision medicine. The predictive, non-generative task is then formalized as a function of a cell, with corresponding attributes such as cell line, disease, and tissue, and a perturbation, such as a drug type or a CRISPR-based perturbation, which outputs a count for gene expression of the cell after the input perturbation.

Impact: Machine learning has significantly advanced the ability to predict how single cells respond to various chemical and genetic perturbations. This capability is crucial for understanding cellular behaviors and developing new therapeutic strategies. Machine learning models have revolutionized the prediction of gene expression responses in single cells to chemical and genetic perturbations by enhancing predictive accuracy, handling dose dependencies, managing complex perturbations, and optimizing experimental designs. These advancements enable more efficient and accurate exploration of cellular responses, facilitating drug discovery and the development of personalized medicine.

Generalization: We measure model generalization across seen and unseen perturbations and across seen and unseen cell lines.

Product: Drug Repurposing, Predicting Adverse Drug Reactions, Biopharmaceuticals

Pipeline: Target discovery, Phenotypic Screening.

scPerturb

Dataset Description: The scPerturb dataset is a comprehensive collection of single-cell perturbation data, harmonized to facilitate the development and benchmarking of computational methods in systems biology. It includes various types of molecular readouts, such as transcriptomics, proteomics, and epigenomics. scPerturb is a harmonized dataset that compiles single-cell perturbation-response data. This dataset is designed to support the development and validation of computational tools by providing a consistent and comprehensive resource. The data includes responses to various genetic and chemical perturbations, which are crucial for understanding cellular mechanisms and developing therapeutic strategies. Data from different sources are uniformly pre-processed to ensure consistency. Rigorous quality control measures are applied to maintain high data quality. Features across different datasets are standardized for easy comparison and integration.

Task Description: Given cell-type-specific labels and a perturbation, predict the gene expression vector for that cell.

Dataset Statistics: 44 publicly available single-cell perturbation-response datasets. Most datasets have on average approximately 3000 genes measured per cell. 100,000+ perturbations.

Dataset Split: Random Split, Seen-unseen splits across cell line and perturbation

from tdc.multi_pred.perturboutcome import PerturbOutcome
data = PerturbOutcome(name = 'scperturb_drug_SrivatsanTrapnell2020_sciplex2')
split = data.get_split()

References:

Peidli, S., Green, T. D., Shen, C., Gross, T., Min, J. K., Garda, S., Yuan, B., Schumacher, L., Taylor-King, J., Marks, D., Luna, A., Blüthgen, N., & Sander, C. (2023). scPerturb: Harmonized Single-Cell Perturbation Data. https://doi.org/10.1101/2022.08.20.504663