Protein-Protein Interaction Prediction Task Overview

Definition: Proteins are the fundamental function units of human biology. However, they rarely act alone but usually interact with each other to carry out functions. Protein-protein interactions (PPI) are very important to discover new putative therapeutic targets to cure disease. Expensive and time-consuming wet-lab results are usually required to obtain PPI activity. PPI prediction aims to predict the PPI activity given a pair of proteins' amino acid sequences.

Impact: Vast amounts of human PPIs are unknown and untested. Filling in the missing parts of the PPI network can improve human's understanding of diseases and potential disease target. With the aid of an accurate machine learning model, we can greatly facilitate this process. As protein 3D structure is expensive to acquire, prediction based on sequence data is desirable.

Generalization: As the majority of PPIs are unknown, the model needs to extrapolate from a given gold-label training set to a diverse of unseen proteins from various tissues and organisms.

Product: Small-molecule, macromolecule.

Pipeline: Basic biomedical research, target discovery, macromolecule discovery.

HuRI

Dataset Description: All pairwise combinations of human protein-coding genes are systematically being interrogated to identify which are involved in binary protein-protein interactions. In the most recent effort 17,500 proteins have been tested and a first human reference interactome (HuRI) map has been generated. From the Center for Cancer Systems Biology at Dana-Farber Cancer Institute.

Task Description: Binary Classification. Given the target amino acid sequence pairs, predict if they interact or not.

Dataset Statistics: 51,813 positive PPI pairs, 8,248 proteins

Dataset Split: Random Split Cold Drug Split Cold Protein Split

from tdc.multi_pred import PPI
data = PPI(name = 'HuRI')
split = data.get_split()

Note: (1) For genes that associate with multiple protein sequences, we separate by * symbol. (2) The dataset contains only positive pairs. All of the unobserved pairs are real negative PPIs, tested experimentally. To get the negative samples, you can call:

data = data.neg_sample(frac = 1)

References:

[1] Luck, K., Kim, D., Lambourne, L. et al. A reference map of the human binary protein interactome. Nature 580, 402–408 (2020).

Dataset License: CC BY 4.0.